Computers can process information very quickly, but it can be difficult for them to process images and videos to detect specific objects. That’s because computers see most outputs in binary language. This is where object detection comes in.

Object detection is a computer vision technique that recognizes objects within bounding boxes in an image and then classifies them. It’s a combination of both object localization and image classification.

In this article, we’ll have a look at object detection’s:

Object detection models


Also known as Region-Based Convolutional Neural Network, R-CNN, was probably the first large and successful convolutional neural network application for object detection, localization, and segmentation.

Demonstrated on benchmark datasets, it provided state-of-the-art results on both the 200-class ILSVRC-2013 object detection dataset and the VOC-2012 dataset. This proposed model contains three modules:

  • Module 1: Region Proposal. This generates and extracts category-independent region proposals, like candidate bounding boxes.
  • Module 2: Feature Extractor. This module extracts features from every candidate region by using, for example, a deep convolutional neural network.
  • Module 3: Classifier. Classify the features in one of the known classes, like the linear SVM classifier model.

Fast R-CNN

Fast R-CNN was proposed as an extension to address speed issues with R-CNN. But this model has a few limitations:

  • Slow object detection, as it makes predictions using a deep CNN on a lot of region proposals.
  • Multi-stage pipeline training requires preparation and operation of three separate models.
  • Expensive training in both space and time, as deep CNN training on a lot of region proposals per image is extremely slow.

Fast R-CNN is presented as a single model in place of a pipeline to learn and output both regions and classifications directly. The model’s architecture takes a photo of a set of region proposals, using it as input, and passes it through a deep CNN. Then it performs feature extraction using a pre-trained CNN, like VGG-16.

Region of Interest Pooling Layer, or ROI Pooling, is a custom layer at the end of the deep CNN that extracts features for a given input candidate region. The output is interpreted by a fully connected layer, which is separated into two outputs: a linear one for the bounding box, and one for the class prediction through a softmax layer.

Faster R-CNN

To further improve training speed and detection, the model architecture was designed to refine and propose region proposals, also known as a Region Proposal Network. These regions are used with a FAST CNN model in one design, which allows for a reduced number of region proposals and speeds up the test-time operation of the model to almost real-time.

This single unified model has two modules, which both work on the same output of a deep CNN. The modules are:

  • Module 1: Region Proposal Network. A CNN for region proposal and the type of object to take into account in the region. It acts as an attention mechanism for the Fast C-RNN network.
  • Module 2: FAST R-CNN. A CNN to extract features from the proposed regions and for bounding box and class labels outputs.


These object detection models have become more popular in recent years, following a key point-based approach for object detection. Compared to R-CNN, for example, CenterNet2 is more efficient and more accurate. A drawback, however, is its slow training process.


YOLO, or You Only Look Once, involves a single neural network that’s trained end-to-end, taking a photo as input and directly predicting class labels and bounding boxes for each bounding box. It provides lower predictive accuracy but works at 45 frames per second.

The model works by splitting the input image into a grid of cells. Each of the cells is then responsible for predicting a bounding box if its center falls within the cell. This is done using the x and y coordinates, along with the width, height, and confidence.

Called YOLO v2, this YOLO model variation was trained on two object recognition datasets in parallel - which are capable of predicting 9,000 object classes. It also uses anchor boxes, like Faster CNN, which are pre-defined boxes with useful sizes and shapes that are tailored during training.

The bounding boxes’ predicted representation changes so that small changes can have less of an effect on predictions, which leads to a more stable model.

Object detection applications

Crowd counting

Crowd counting can be useful in heavily populated areas like airports and shopping malls. It can help to track factors such as road traffic and the number of passing vehicles.

Autonomous cars

Autonomous vehicles are successful because they have real-time object detection capabilities. The artificial intelligence-based models allow for the location, identification, and tracking of objects around the vehicles, for increased efficiency and safety.

Anomaly detection

Object detection models are extremely useful in agriculture, as they can accurately detect potential instances of plant disease. This lets farmers know as soon as it becomes an issue so that crops aren’t completely lost. The models can also help in healthcare, detecting skin lesions before they become more dangerous.

Video surveillance

Object detection and tracking of object movements in real-time help surveillance cameras track the recording of scenes in specific locations. This modern technology can accurately recognize and locate various instances of objects in the video. The system stores the data with real-time tracking feeds as objects move.