Target tracking refers to tracking one or more targets of a particular interest in a given scenario. Traditionally, it has been used in video and real-world interactions, which are observed after initial target detection. It is now essential for autonomous driving systems, such as the self-driving vehicles of companies such as Uber and Tesla.
The target tracking method can be divided into two categories according to the observation model: the generation method and the discrimination method. The generation method uses a generation model to describe the apparent features and minimize reconstruction errors to search for targets, such as PCA. The discriminant method can be used to distinguish between the target and the background, and its performance is more robust, and it gradually becomes the main method of tracking. The discriminating method is also called detection tracking, and deep learning belongs to this category. In order to achieve tracking by detection, we detect candidate targets for all frames and use deep learning to identify the desired target from the candidates. Two basic network models are available: Stack Autoencoder (SAE) and Convolutional Neural Network (CNN).
The most popular deep network using SAE tracking tasks is the Deep Learning Tracker, which proposes offline pre-training and online fine-tuning networks. The process is like this:
Offline unsupervised pre-training A stacked denoising autoencoder using a large-scale natural image dataset to obtain a general target representation. By adding noise to the input image and reconstructing the original image, the stack denoising autoencoder can achieve more robust feature representation.
The coded portion of the pre-trained network is combined with the classifier to obtain a classification network, and then the network is fine-tuned using the positive and negative samples obtained from the initial frame, which can distinguish the current target from the background. DLT uses a particle filter as a motion model to generate candidate patches for the current frame. The classification network outputs the probability scores of these patches, indicating the confidence of their classification, and then selects the highest patch of these patches as the target.
In model updates, DLT uses a way to limit thresholds.
Due to its superiority in image classification and target detection, CNN has become the mainstream depth model for computer vision and visual tracking. In general, large-scale CNN can be trained both as a classifier and as a tracker. Two representative CNN-based tracking algorithms are Full Convolutional Network Tracker (FCNT) and Multi-Domain CNN (MD Net).
FCNT successfully analyzed and utilized the feature map of the VGG model, which was a pre-trained ImageNet and produced the following observations:
CNN feature maps can be used for positioning and tracking.
Many CNN feature maps are noisy or unrelated to the task of distinguishing a particular target from its background.
Higher layers capture the semantic concept of object classes, while lower layers encode more discriminant features to capture intra-class variations.
Due to these observations, FCNT designed a feature selection network to select the most relevant feature maps on the conv4-3 and conv5-3 layers of the VGG network. Then, to avoid overfitting the noise, it also designed two additional channels (called SNet and GNet) for the selected features of the two layers. GNet captures the category information of the target, while SNet targets the region of interest (ROI). Finally, through SNet and GNet, the classifier obtains two predicted heat maps, and the tracker decides which heat map to use to generate the final trace result based on whether there is a disturber. The flow of FCNT is as follows.
Unlike FCNT's idea, MD Net uses all sequences of video to track moving objects. The above network uses irrelevant image data to reduce the training requirements of tracking data, and this idea has some deviation from tracking. The target of one class in this video can be the background in another video, so MD Net proposes the concept of multiple domains to independently distinguish between targets and backgrounds in each domain. A domain represents a set of videos that contain targets of the same type.
As shown below, MD Net is divided into two parts: the shared layer and the K branch of a specific domain layer. Each branch contains a binary classification layer with softmax loss, which is used to distinguish between targets and backgrounds in each domain, and the shared layer is shared with all domains to ensure a generic representation.
In recent years, deep learning researchers have experimented with different approaches to adapt to the characteristics of visual tracking tasks. They have explored many directions: applying other network models, such as recursive neural networks and deep confidence networks, designing network structures to accommodate video processing and end-to-end learning, optimizing processes, structures and parameters, or even deep learning with traditional computers Visual methods or methods combining language processing and speech recognition in other fields.
to be continued