My name is Hamed Nazari. I'm the Principal Scientist of AI & Computer Vision at Comcast Silicon Valley Innovation Center.

In this article, I’ll talk about a pipeline for computer vision, both on-premise and on the edge. There’ll be no math or stats, I’ll be sharing what I experienced in the course of a year of development.

What is computer vision?

Computer vision means extracting and understanding the content of an image. The image can be a frame of videos. If you combine the knowledge set you get from one frame, you would get a good understanding of the content of the video. The foundation of computer vision is just one frame of a photo.

We extract the description from the image. For example, if you have an image of a dog walking on the beach, you can make a model to extract the description of that image and apply it to each frame of a video. You can write the whole content of a video as well.

If you have a few frames, you can group them into one and come up with the topic of the scene in the video. This is called grouping and searching image content. There are a lot of GAM (Generative Adversarial Model), which help nowadays with making a 2D image 3D.

Technology is moving towards that path for sales and production, specifically with big companies that sell their products online. They’re trying to present the products they have as 2D images in a 3D in AI fashion.

Not that this can’t be done manually with 360 degrees cameras, but this way can streamline their processes, if they have a platform for their third-party sellers. So, they load their images and then make a 3D view of that image automatically.

After cleaning up your images, you push them to your computer vision models you have for inferencing. You can also save the frames for later use, or save your data in a database for reporting purposes.

Your guide to computer vision
With modern advancements in artificial intelligence and computational power, computer vision has become an integral part of everyday life. Computers’ ability to ‘see’ and interpret the world around them helps in the analysis of the massive amounts of data created in daily operations.

What is image processing?

Image processing isn't the same as computer vision. This process means taking an image and manipulating it to prepare it to fit computer vision models.

Image processing creates a new image size-wise from an existing image you have. This process can also be about simplifying the content in some way, as sometimes you run into images with a dark background you want to remove.

Removing noise, using different filters, cropping, and resizing, are all considered image processing, up to the point that you take that image and convert that to a 2D or 3D array, and prepare it for your model.

Challenges of computer vision

The visual world is complex and our own vision is complex enough to identify each person differently and separately, but computer vision is not yet at this stage.

For example, you’re interested in learning if someone is paying attention to specific content. I want to make sure Person A is looking and paying attention to me, so what’s the best possible way to do that?

I can take the image in a different orientation and then interpret it in a different way. It sounds simple, but when it comes to the action, it’s complex because of a variety of factors, such as lighting conditions. If you have poor lighting conditions the cameras are blinded very fast. Even little changes in light will drastically impact your model's output.

Modern applications of computer vision

The modern applications of computer vision are object classifications. At Comcast, we’re very interested in video content analysis, because we want to offer products to our customers. We run analytics or computer vision techniques on video frames to understand what they are interested in, and what we should offer as an upsell to our customers.

Object identification can be identifying a person with a face or guns in a crowd. If you want to differentiate two people, we go with the landmark which I implemented and is working properly fine.

Deep learning

Deep learning is unlike traditional machine learning in that you have a bunch of sparse data and you don't know how to manage it. Nowadays, Deep Learning offers a blackbox solution that you can push your data and then extract the features that are most important to that specific frame.

For example, if you have an image of a person with a dark background, then the dark background will be removed automatically and the face gets detected as a feature or gets extracted, and will be passed or pushed to the next layers of the network.

Deep learning is an end-to-end model where you build one model, push your image and your data, and then get an output. You can reuse the model many times.

If you have many layers, you have to visualize each layer and each neuron to see what happened in that specific timeframe on that neuron, and then how filters impact your images.

It involves a mathematical and derivative proven process and offers superior performance and accuracy. There’s a general method that’s easy to learn; you can take a course on it for three to four months and become a deep learning junior engineer. However, if you want to build a model the way that big companies like Apple, Facebook, and Google do, it takes a lifetime to get to that level.

Your guide to deep learning
Deep learning teaches computers to do what humans can do - learning by example. It’s the driving factor behind things like self-driving cars, allowing them to distinguish between pedestrians and other objects on the road.

Tools I use and recommend

The camera I use in my project is Logitech 10 ADP and Logitech 4k for proof of concept. In the Comcast labs, we make a proof of concept, and then turn it to production if other teams across the company are interested.

The proof of concept we built with the Logitech 10 ADP and Logitech 4K. In the production phase in our facilities, I then replaced the 4K cameras with AK to get more features.

Backstreet Ultra 4K is another product that’s very powerful and strong in generating 4k frames.. In terms of action detection, four key cameras give you enough features that you can process very accurately.

I use OpenCV for image processing and then I use the deep neural network, CNN, YOLO, faster CNN, LSDM, faster LSDM, and many more. I then picked the best result out of them.

The need for GPU Processors

I was using 4K images and you get many pixels in each frame so obviously, you can't process this data for two reasons:

  • One, you can't make a model to process that many pixels.
  • Second, you have to downsize the pixels. Even if you downsize the pixels, you get three times as much because your images are colorful and 3D arrays.

GPU processes data and NumPy arrays as a tensor and it runs fast processes in the graphics chips. If you want to use a simple algorithm with four layers of neurons (hidden layers in the deep neural net) so you're dealing with only a few features.

However, if you want to go to YOLO base architect, you're dealing with millions of features so these will require more processing power.

Tensor processors' multi-dimensional arrays and these features in a complicated architecture, result in a heavier weight matrix to be processed, so that's why we need the GPU.

The first architecture I implemented

This is the architecture that I implemented first. I had a server for processing the computer vision models and the image processing with GPU, and then one server to save data in a database. There was another server, which is called Synology, for X-Axis cameras, which had its own server. We’re dealing with three servers here, three maintenance and installation of operating systems, and setting up everything from scratch.

The cameras grab frames and push these frames to image processing. In my case, I had 15 cameras. You have to have a process to synchronize these frames otherwise you’re not going to extract data at the same time from the same scene, so synchronization was a big challenge.

Then, you have computer vision models to pass each frame to, and then save the data in the database and aggregate it based on the time. For that process, I used an NVIDIA GPU which you can find in Amazon Web Services. Nowadays, they offer NVIDIA products as a service.

The requirements for the first pipeline, which was, in-premise, the three servers to maintain, the maintenance of GPUs (and what I mean by the maintenance of GPUs, installation of CUDA and all those software that NVIDIA offers), operating system setup installation, moving frame from cameras, to the server, to process and synchronization.

We also had to manage the speed and then the security of the frames, as you don't want the frames to be shared anywhere with anybody. Getting around maintenance time

I attended the GTC Conference, which is held by NVIDIA. I ran into chips offered by NVIDIA, called Jetson nano TX2, which is a very powerful chip and has a GPU built in.

I bought five of them and then experimented with changing the architecture and pipeline. This is the way I chose: grab these frames, then push them to Jetson nano, process them on-premise, and then send the Enel analogic result to the server.

No servers, no Santos, no speed problem, no bandwidth problem, it was really convenient. If I could make this work it would be a huge breakthrough.

The only problem I had was synchronization. I got rid of the synchronization frame process because when I was processing each frame I was saving that frame with the date time, pushing that to the server, and then aggregating based on the time for each frame, so it was really easy.

GPUs pioneers in the market

NVIDIA is the one I use. I got Google TPU at a conference, which was hard for me to make work, and by the time I had made it work, it was too late, as I was already using Jetson TX2.

The requirement for on-edge compute is a commodity machine. You don't need a server because your servers are replaced with Jetson Nano. Instead, you need a computer commodity machine to install a database engine on that to save your data. On-edge computing chips are very cheap.

You still need cameras and minimal network bandwidth. It initially cost us a lot to have a fiber line from the camera to our servers, but we don't need that anymore. With a few algorithm optimizations using tensor, RT changed the algorithm size to be compatible with Jetson Nano. Then the synchronization issue was gone, and it complies with security regulations as well.

Which one is better on edge or on-premise? We don't know. It depends. If you have low bandwidth and higher security in place, then on edge is better. If you have higher bandwidth, no problem with the bandwidth, and low security, then, I believe on-premise is better.