We’ll examine which computer vision techniques surveyed companies considered the most exciting, as highlighted in our Computer Vision Landscape 2022 report.

Object detection and tracking were highlighted as the computer vision technique respondents were most excited about, with 52.8% of the votes. Instance segmentation and semantic segmentation at 30.6%, and image classification at 16.6%, were the second and third-most popular problems solved, respectively.

Now, let’s take a closer look at the computer vision techniques that were seen as the most exciting by this year’s respondents:

  • Object detection and tracking
  • Image classification
  • Instance segmentation

1. Object detection and tracking

Object detection is needed to start the tracking process and is continually applied in every frame. One popular approach extracts temporal information from a sequence of images, learning the static background scene model, and comparing it with the current scene.

Change detection

Identification of changes in pixel states is achieved by examining discrepancies between  appearance values across video frame sets. Common techniques are:

  1. Frame differencing. Intensity dissimilarity between two frames. Assumes that changes to a pixel’s intensity indicate that something has changed in the image.
  2. Background subtraction. A scene representation or background model is built and then model deviations are observed for every incoming frame. Any changes in the model are assumed to be a moving object.
  3. Motion segmentation. Assigns groups of pixels to one of several classes, based on the direction and speed of their movements.
  4. Matrix decomposition. The entire image is vectorized and used in background modeling. The background is represented by the most descriptive eigenvectors, and foreground objects are detected by projecting the current image to the eigenspace – a difference is then searched for between the reconstructed and actual images.

Object modeling

For object detection and tracking, you need an internal representation of a suitable object to act as a prototype, so that detected objects can be matched to descriptors in image features.

1. Model representations

Model representations are chosen according to the application domain. The model you choose  to represent the object-to-be-tracked limits that object’s motion type and the deformation it can undergo.

  • Point and region. Objects can be represented by a set of points or a predefined shape around their centroid.
  • Silhouette. The region inside the object’s contouring boundary, with the most common representation being a binary indicator function that marks the object region by ones, and the non-object regions by zeros. Contour-based methods represent the silhouette implicitly (on a grid) or explicitly (by a set of control points).
  • Connected parts. Articulated objects are made of parts held together by joints, with the relationship between the parts being governed by kinematic motion models.
  • Graph and skeletal. Skeletal models are used to animate humans and characters in graphics. The object skeleton is extracted by applying a medial axis transform.
  • Spatiotemporal. While lacking motion indicators, specific representations are defined in the spatiotemporal space, which means they inherently portray motion information.

2. Model descriptors

Mathematical embodiments of object regions – the region size, imaging noise, dynamic range, and artifacts all play an important role in attaining discriminative descriptors.

  • Template. The most commonly adopted and intuitive, often formed from silhouettes or geometric shapes. Can be 2D (spatial) or 3D (spatiotemporal).
  • Histogram, SIFT, and HOG. Distribution-based descriptors that estimate distribution probability from observations within a spatial or spatiotemporal region defined by a silhouette, a template, or a volume. HOG (Histogram of Oriented Gradients) and SIFT (Scale Invariant Feature Transform) are two closely related approaches.
  • Region covariance. A region covariance matrix proposes a natural way to fuse multiple features. With diagonal entries representing the variance of each feature and nondiagonal entries representing correlations.
  • Ensembles and eigenspaces. Ensemble descriptors are a combination of partial or weak descriptors, working by constantly updating weak classifier collections to separate objects from backgrounds. Eigenspaces are compact view-based object representations learned from a set of input images.
  • Appearance models. Generated by modeling the object’s shape and appearance at the same time, with the object’s shape defined by a set of landmarks. Landmarks can live on the object’s boundary or inside the object region.

3. Model features

The features used for tracking can affect performance. The features that best distinguish between several objects, and between objects and backgrounds, are also best at tracking objects.

  • Gradient. Object boundaries create string image intensity changes and edge gradients identify them.
  • Color. An object’s apparent color is influenced by its surface reflectance properties and the spectral power distribution of the illuminant. The RGB (red, green, blue) color space is commonly used in image acquisition, but it’s not perceptually uniform like YUV and LAB. HSV (Hue, Saturation, Value) is an approximately uniform color space.
  • Optical flow. A dense field of displacement vectors defines the translation of each pixel in a region.
  • Texture. The measure of intensity variation of a surface, which quantifies properties like regularity and smoothness. It needs a processing step to generate descriptors.
  • Corner points. One of the first and most commonly used features, as it has low computational complexity and is easy to implement.

Object tracking occurs after an object is detected. It’s an important component of various computer vision applications, as it can provide complete regions in images that contain objects.

1. Common tracking techniques

  • Template matching. Template, or blob, matching is the most common approach. This brute-force method searches an image for a similar region to the object template that was defined in the previous frame.
  • Density estimation: mean-shift. This is a nonparametric density gradient estimator used to find image windows that are most like the object’s color histogram in the current frame.
  • Regression. The understanding of the relationship between multiple variables.

Other common techniques include motion estimation, Kalman filtering, particle filtering, multiple hypothesis tracking, and silhouette tracking.

2. Common object tracking methods


This is an effective association method that uses all detection boxes from high to low scores in the matching process. It’s built on the premise that the similarity with tracklets offers a strong cue to tell apart the background and the objects in low score detection boxes.

By using Kalman Filter, ByteTrack predicts the location of the tracklets in the new frame. The motion similarity is then computed by the IoU of the predicted box and the detection box. It then performs the second matching between unmatched tracklets and the low score detection box.

Simple Online And Realtime Tracking (SORT)

SORT is a lean implementation of a tracking-by-detection framework, ignoring appearance features beyond detection components. It uses the size and position of bounding boxes for data association and motion estimation through frames.

Faster R-CNN is used for the object detector, with the object displacement in the consecutive frames estimated by a linear constant velocity model that is independent of camera motion and other objects.


DeepSORT was built to overcome SORT’s limitations, replacing the association metric with one that’s more informed and that combines motion and appearance information. A “deep appearance” is added, aiming to obtain a vector that can represent a given image.

This method creates a classifier and strips the final classification layer, which leaves the dense layer that produces one single feature vector.


A new spatial-temporal graph transformer than solves the following issues of transformer-based trackers:

  • Videos can contain a large number of objects. Modeling their spatial-temporal relationships with a general transformer isn’t efficient as it doesn’t account for the objects’ spatial-temporal structure.
  • A Transformer needs a lot of data and computational resources to model long-term temporal dependencies.
  • The DETR-based object detector that’s used in these applications isn’t state-of-the-art.

This is where TransMOT comes in, arranging trajectories of tracked objects as a series of sparse weighted graphs built by using the targets’ spatial relationships. It then uses the graphs to create a spatial graph transformer encoder layer, a spatial transformer decoder layer, and a temporal transformer encoder layer to model the objects’ spatial-temporal relationships.

Due to the sparsity of the weighted graph representations, TransMOT is more computationally efficient during both training and inference.


A new tracking approach that’s built on top of CenterNet, the anchor-free object detection architecture. Both detection and re-ID tasks are treated the same with this method, which diverges from the previous framework iterations of “detection first, re-ID second”.

With a simple network structure of two homogeneous branches for extracting re-ID features and detecting objects, it adopts ResNet-34 as its backbone for balance between speed and accuracy.

2. Image classification

Image classification tries to understand an entire image as a whole, with the goal being to classify the image by assigning specific labels to it.

Image classification structure

  1. Image pre-processing. Improves image data (features) by enhancing vital image features and suppressing unwanted distortions. Steps include reading images, resizing images, and data augmentation.
  2. Object detection. Localizes objects by segmenting an image and identifying the position of the object of interest.
  3. Feature extraction and training. Deep learning or statistical methods help to identify the most interesting patterns in images, which will lead to the differentiation between separate classes.
  4. Object classification. Categorizes detected objects into predefined classes through suitable classification techniques that compare image patterns with target patterns.

Supervised classification

Uses the spectral signatures that are obtained from training samples in order to classify an image. The three basic steps for supervised classification are:

  1. Selecting training areas.
  2. Generating signature files.
  3. Classifying.

Unsupervised classification

Finds spectral classes, or clusters, in a multiband image without needing human intervention. This is the most basic technique, as it doesn’t need samples and is an easy way of segmenting and understanding images. The two basic steps for unsupervised classification are:

  1. Generating clusters.
  2. Assigning classes.

Convolutional Neural Network (CNN)

Uses a few of the features of the virtual cortex to achieve results in computer vision tasks. Composed of convolutional layers and pooling layers, CNNs are multi-layer neural networks that recognize visual patterns straight from pixel images with minimal pre-processing.

Artificial neural networks

Based on biological neural networks, these are statistical learning algorithms for various tasks like simple classifications or computer vision. They’re implemented as a system of interconnected processing elements, or nodes, that are functionally analogous to biological neurons.

Support Vector Machine (SVM)

Support vector machines are extremely popular due to their ability to handle various continuous and categorical variables. These are powerful and flexible supervised machine learning algorithms for both classification and regression.

K-nearest Neighbor

This non-parametric method is great for classification and regression, in which the input consists of the k closed training examples in the feature space. K-nearest Neighbor is the simplest algorithm.

Naïve Bayes algorithm

These are a collection of classification algorithms based on the Bayes’ Theorem. As a family of algorithms in which they all share a common principle, it’s a simple technique for constructing classifiers.

Random Forest algorithm

This is a supervised learning algorithm used for classification and regression. These algorithms create decision trees based on data sets and get predictions from each, before finally choosing the best solution by voting.

3. Instance segmentation and semantic segmentation

Semantic segmentation classifies every pixel in a given image into a class. This differs from instance segmentation, which is the task of detecting and delineating every distinct object of interest in a given image.

Instance segmentation deals with detecting instances of objects and demarcating boundaries. It needs the detection of multiple instances of different objects in a single image, alongside their per-pixel segmentation mask. Methods can be both R-CNN and FCN (Fully Convolutional Networks) driven.


1. U-NET

This is a convolutional neural network originally created for segmenting biomedical images. When it’s visualized, it looks like the letter U – hence its name. The architecture is made of two parts:

  • Left part: the contracting path, which captures context. Made of two three-by-three convolutions, which are followed by a rectified linear unit and a two-by-two max-pooling computation for downsampling.
  • Right part, the expansive path, helps in precise localization.

2. Fast Fully Convolutional Network (FastFCN)

A Joint Pyramid Upsampling, or JPU, module replaces dilated convolutions, as these consume a lot of time and memory. JPU upsamples low-resolution feature maps to high-resolution ones.

3. Gated-SCNN

Composed of a two-stream CNN architecture, in which, in the model, a separate branch processes image shape information. The shape stream then helps to process boundary information.

4. DeepLab

Convolutions with unsampled filters are used in tasks with dense prediction. Object segmentation is completed at multiple scales through atrous spatial pyramid pooling. D-CNNs then improve object boundary localization, and filters are upsampled through the insertion of zeros, or sparse sampling of input feature maps, to achieve atrous convolution.

5. Mask R-CNN

Objects are classified and localized through a bounding box and semantic segmentation, which classifies each individual pixel into a set of categories. All regions of interest get a segmentation mask and a class label and bounding boxes are produced as the final output. This architecture is an extension of Faster R-CNN, which is composed of a deep convolutional network proposing regions and a detector utilizing regions.