With modern advancements in artificial intelligence and computational power, computer vision has become an integral part of everyday life. Computers’ ability to ‘see’ and interpret the world around them helps in the analysis of the massive amounts of data created in daily operations.

Modern computer vision uses machine learning algorithms, more specifically a neural network, to determine insights from this data. These neural networks extract patterns from samples, working similarly to human brains.

In this guide, we’ll cover:

What is computer vision?

Computer vision is a field of computer science that aims to create digital systems that are able to process, analyze, and use visual data in the way that a human will do. It derives important information from digital images or other visual inputs, and then takes actions or makes recommendations based on the information it gets.

Artificial intelligence lets computers think, and computer vision lets them see by training machines to undertake specific tasks.

Read all about how Graphcore is helping to scale the training of computer vision models on the IPU:

Training Edge AI Solutions at Scale
Lakshmi Krishnan talks about how Graphcore helps scale the training of computer vision models on the IPU.

Or you can read more about machine learning in our guide below:

Your guide to machine learning
A branch of artificial intelligence and computer science, machine learning uses algorithms and data to copy how humans learn, in order to improve its accuracy.

How does computer vision work?

1. Image acquisition

Through real-time photos, video, or 3D technology, images are acquired in both small and large sets.

2. Image processing

Image processing is a mostly automated process due to deep learning models, which need to first be trained by being fed tens of thousands of pre-identified or labeled images.

3. Image understanding

Objects are classified or identified.

Learn more about computer vision with our on-demand talks by AI experts:

Computer Vision Festival 2021
Catch up on all the sessions from this year’s festival and learn from visionary Ai experts from NASA, Apple, Nike, Intel, Graphcore, and more...

Even though computer vision can be summarized in three simple steps, image processing and understanding can be challenging. A single image is composed of many pixels, or picture elements, which is the smallest quanta - the plural of quantum, representing the minimum amount of any physical object in an interaction - in which we can divide an image into.

Computers then process images in an array of pixels, with each pixel having a set of values that represent both the intensity and the presence of the primary colors: Red, Green, and Blue. The RGB color model is often used in color images, with each pixel being a mix of these three colors.

As computers can only understand numbers, each pixel is then represented by three numbers, which correspond to the amount of red, green, and blue in each pixel. With grayscale images, each pixel is just one number, representing the intensity or amount of light it has. This scale is often represented from 0 (black) to 255 (white), with everything in between being various shades of gray.

The array of pixels forms a digital image, which becomes a matrix. Complex applications will have operations such as downsampling via pooling and convolutions with learnable kernels, while simpler algorithms make use of linear algebra to manipulate a matrix.

Computers have to use algorithms that can recognize complex patterns in images by performing complex calculations on the matrices to extrapolate relationships between pixels.

Three operations that are based on deep learning perspectives and often utilized in computer vision are:

  • Convolution. An operation in which a learnable kernel is ‘convolved’ with an image, meaning that the kernel is slides pixel by pixel across an image. An element-wise multiplication is then made between both image and kernel at every pixel group.
  • Pooling. The dimensions of an image are reduced by undertaking operations at a pixel level. The way it works, a kernel slides across an image, with only one pixel from a corresponding pixel group being chosen for more processing, which reduces the size of the image.
  • Non-linear activations. The stacking of multiple convolutions and pooling blocks increases model depth due to the introduction of non-linearity in the neural network.

A brief history of computer vision

1959: Computer vision experimentation starts, with neurophysiologists showing an array of images to a cat and trying to correlate a response in the brain. They found that the cat first responded to hard lines, which meant that image processing begins with simple shapes. The first computer image scanning technology was developed around the same time, which let computers both digitize and acquire images.

1963: Computers are able to transform 2D images into 3D forms. The 1960s were marked by the emergence of AI as an academic field of study, alongside being the beginning of AI trying to solve human vision problems.

Want to know more about AI? Read our guide below:

Your guide to artificial intelligence
Artificial intelligence (AI) helps to build smart machines that can perform a variety of tasks that would otherwise require human intelligence.

1974: The introduction of optical character recognition (OCR) technology. It was able to recognize printed text in any typeface or font. Intelligent character recognition (ICR) could decode hand-written text through neural networks. Both OCR and ICR have since been applied to mobile payments, vehicle plate recognition, and more.

1982: Neuroscientist David Marr determines that vision is hierarchical, and introduces machine algorithms to detect corners, edges, curves, and other basic shapes. Computer scientist Kunihiko Fukushima develops a network of cells that is able to recognize patterns, called Neocognitron, which has convolutional layers in a neural network.

2000s: In 2001, the first real-time face recognition applications begin to appear. Object recognition becomes the focus throughout the decade, alongside the emergence of the standardization of how visual data sets are both tagged and annotated.

2010: The ImageNet data set is available, containing millions of tagged images over a thousand object classes. It provides the foundation for CNNs (Convolutional Neural Networks) and deep learning models.

2012: A CNN is entered into an image content by a team from the University of Toronto. The mode, AlexNet, majorly reduces error rates for image recognition to only a few percent.

Computer vision applications

Improving quality control by automating visual inspections
Computer vision plays a big part in deploying automated visual inspection, making it possible to process the amounts of data from this automation.

Object detection

Using bounding boxes, it detects and locates objects by looking for class-specific details in a video or image. It then identifies them whenever the details appear. These classes are divided into what the detection model has been trained to classify, like animals. Object detection methods previously used HOG Features, Haar Features, and SIFT based on classical machine learning approaches.

Face recognition

A subpart of object detection, face recognition’s primary object to detect is the human face. As an application, it’s similar to object detection, but it also undertakes object recognition. These systems look for landmarks and common features, such as lips or eyes, to classify a face by using features and landmark positioning.

Scene reconstruction

An extremely complex application, scene reconstruction relates to the 3D reconstruction of objects from photos. Algorithms usually reconstruct objects through the formation of point clouds on object surfaces and then reconstruct a mesh from the point cloud.

Video motion analysis

Studying moving animals or objects and their trajectories, it combines tracking, object detection, pose estimation, and segmentation. It can be used in areas like manufacturing, medicine, sports, and more.

Image classification

Probably the most popular application in computer vision, it classifies a group of images into a set of predefined classes by only using a set of sample images that are already classified. It deals with the processing of entire images and assigns specific labels to them.

Read more about how the Volkswagen Group is using computer vision:

Scaling of industrial computer vision within Volkswagen Group
Jakob Engelmann, Product Owner Industrial Computer Vision at Volkswagen Group, outlines his strategy for effectively implementing ICV in his organization and describes the successes and challenges along the way.

Image restoration

Image restoration is the restoration or reconstruction of old or faded hard copies of images that have lost their quality. This process usually involves reducing additive noise through mathematical tools or image inpainting, if further analysis is needed.

With image inpainting, generative models make an estimate of what the damaged parts of images mean to fill them in. Should images be in black and white, a colorization process usually follows in a realistic way.

Edge detection

Using mathematical methods that aid in the detection of sharp changes or discontinuities in image brightness, it detects boundaries in an object. Edge detection is usually utilized as a pre-processing step in many applications, mainly being done through convolutions that have specially-designed edge detection filters and through traditional image processing-based algorithms such as Canny Edge.

Edges in images provide valuable information about the images’ contents, meaning that deep learning methods perform edge detection internally to capture global low-level features through learnable kernels.

Want to learn about edge computing? Read more in our guide below:

Your guide to edge computing
With the unprecedented volume of data and devices connected to the internet, cloud and AI services that automate and speed up innovation through insights are no longer enough.

Image segmentation

Image segmentation relates to the division of an image into sub-objects or subparts to show that the machine can distinguish an object from either the background or another object in the image. An image ‘segment’ represents a certain class of object that has been identified in an image by the neural network and is then represented by a pixel mask that can be utilized to extract it.

Both modern deep learning architectures (like FPN, SegNet, etc) and traditional image processing algorithms have been used to study image segmentation.

Want to know more about deep learning? Read our guide below:

Your guide to deep learning
Deep learning teaches computers to do what humans can do - learning by example. It’s the driving factor behind things like self-driving cars, allowing them to distinguish between pedestrians and other objects on the road.

Feature matching

Features are regions in images that provide the most information about specific objects in the images. Edges and corners can be big indicators of object details, making them vital features. This helps to make correlations of features in similar regions of a particular image with regions of another image. Feature matching is typically used for camera calibration and object identification, and tends to be performed in the following order:

  • Feature detection. Regions of interest are detected by image processing algorithms like SIFT.
  • Formation of local descriptors. When features are detected, regions that surround keypoints are captured and local descriptors obtained. These are the representations of a point’s local neighborhood.
  • Feature matching. Local descriptors and features are matched in corresponding images.


Image segmentation can be effective during the analysis of medical scans, by detecting disease and rating its severity. With image information being around 90% of all medical data, computer vision is an important process in diagnosis.

Watch on-demand our healthcare interview series with experts:

Transforming Healthcare with AI: Interview series
Our interview series is here to deliver you digestible intelligence from the organizations and innovators leading the world of AI in healthcare - through expert and in-depth interviews. Tune in. 🎧 What to expect The mission of the live broadcast is to dive deep into the most innovative minds leadin…

Self-driving cars

Smart vehicles use their cameras to capture videos from several angles, sending these videos an input signal to computer vision software. The video is then processed in real-time to detect objects such as traffic lights and pedestrians.

Augmented reality

Computer vision helps augmented reality apps in the detection of physical objects, like individual objects and surfaces, in real-time to place virtual objects in the physical environment.

Content organization

Apple Photos, for example, automatically tags photos and lets users browse structured photograph collections, alongside creating curated views of users’ best moments.

For more resources like this, AIGENTS have created 'Roadmaps' - clickable charts showing you the subjects you should study and the technologies that you would want to adopt to become a Data Scientist or Machine Learning Engineer.