Page 1 of 1

YOLO or You Only Look Once.

Posted: Tue Jan 21, 2025 10:24 am
by Maksudasm
This neural network is recognized as the most accurate in solving problems of object detection/recognition in real time on the Microsoft COCO dataset.

YOLO neural network

The principle of operation of YOLO, which is a deep convolutional neural network, is as follows: first, the input image is divided into a set of cells that form a grid. Then the original image is compressed until a square matrix of 13 by 13 is obtained. Each of its cells carries information about the presence of an object, its class in the corresponding part of the image. Thus, YOLO only needs to view the image once, and not twice, as similar neural networks do. This approach significantly increases the processing speed and requires significantly less computing power.

YOLOv5 is an improved, fifth gcash database version, which is implemented on the PyTorch framework. YOLOv5 is part of the eponymous module for Python3, which can be installed from pypi. The advantages of this model include the fact that there are several online data markup services, thanks to which YOLO can be trained. Thus, in just 2-3 hours, a neural network can be trained to search for objects of a certain class.

FairMOT
FairMOT or Fair Multi-Object Tracking. The distinctive feature of this high-performance method is the ability to track not one, but several objects in a video, using machine learning technologies. This development was carried out by specialists from Microsoft together with scientists from the University of Central China.

FairMOT neural network

The method uses a single-stage implementation based on a deformable convolutional neural network (DCNv2, Deformable Convolutional Network), which can significantly increase the speed of object tracking. To train the model, FairMOT used a combination of 6 public datasets for detecting and searching people, namely ETH, CityPerson, CalTech, MOT17, CUHK-SYSU. According to the testing results, FairMOT works faster than competing models TrackRCNN and JDE, provided that it works on video streams with a frequency of 30 frames per second.

MediaPipe
The vast majority of neural networks can detect and recognize only 2D objects, even when it comes to video footage. That is, the frames drawn around the detected object will also be 2D. But time dictates the need for a more accurate spatial method of detecting and tracking objects.

To solve this problem, Google AI developed the MediaPipe Objectron mobile pipeline, which can detect 3D things in real time for everyday objects. It can also detect things in 2D images. MediaPipe's Objectron uses a single-stage model to predict poses. If we look at its structure, it is primarily an encoder-decoder based on MobileNetV2.