What helps self-driving cars “see”? At their core, autonomous vehicle perception systems rely on object tracking algorithms to help understand their surroundings and ensure optimal safety.
By tracking surrounding cars, we can understand their speed and distance, and thus decide if an autonomous vehicle should switch lanes or slow down. From the tracking result of pedestrians, we can understand their motion (are they standing still or crossing the street?), and decide accordingly if it’s time to hit the brakes.
Visual object tracking is an important topic in computer vision, extending far beyond autonomous driving applications — from traffic monitoring, robotics, medical imaging, and beyond — but how does this technology work?
In this post, we’ll examine some of the key concepts, models, and research involved in high-performing object tracking algorithms, and how Sama is applying them to our data annotation platform to help our customers achieve greater labeling throughput.
Key concept 1: Single object tracking
Depending on the desired tracking state, object tracking can be divided into a series of more specific tasks: single/multiple object tracking, video object segmentation, and multiple object tracking and segmentation.
For the purpose of this post, we will focus on single object tracking, which uses the initial state of a target in the first frame to automatically obtain the object states in subsequent video frames.
With single object tracking for video annotation, annotators label a single keyframe — in this case, a bounding box around a road sign — and ML extrapolation then accurately predicts and annotates a set number of subsequent frames.
Key concept 2: Siamese networks
If you were asked to identify a certain car in a still image, what would you do?
Intuitively, you would likely scan through the image and identify vehicle-like objects, then compare their appearance with your reference image until you found the one that looked most similar.
In machine learning, this similarity matching process can be done using a Siamese network — an artificial neural network that typically contains two identical subnetworks that share the same architecture and parameters.
As shown above, a Siamese network typically contains two identical subnetworks that share the same architecture and parameters. When comparing object 1 and object 2 (on the left), we extract their feature maps (s1 and s2) using the Siamese network. From there, we can calculate the similarity between the feature maps to decide whether they represent the same object or not.
Key concept 3: Search windows
Using a Siamese network, we are able to identify a tracking object in a single frame. Now it’s time to identify that same object in subsequent frames. Let’s say we want to track the black car in the red box below:
You’ll notice that the car is quite small relative to the size of the entire frame.
Traditional sliding window approaches would compare every possible patch in the frame with the car, resulting in needlessly slow processing times. Luckily, there’s a more efficient way to track the black car across subsequent frames, using something called a search window.
In most cases, it can be assumed that the object size and location in two adjacent frames will be similar. Therefore, based on the object size and location in the previous frame, we can generate a search window and only perform tracking in that region:
This commonly-used search window strategy can measurably improve tracking efficiency but is not without its drawbacks — more on that later.
Siamese trackers literature review
In the field of tracking, Siamese approaches have gained a lot of attention in recent years due to their balance of accuracy and speed. As we mentioned above, Siamese trackers work by looking for the target in the searching frame using similarity comparison and treating tracking as a target matching problem.
A multitude of Siamese trackers have been proposed, and we will cover a subset of them below, along with their strengths and limitations.
As one of the pioneer works in the field, we’d be remiss not to mention SiamFC . SiamFC estimates the region-wise feature similarity between two frames in a fully convolutional manner. However, it lacks bounding box regression and needs to do multi-scale test, which made it less elegant and the bounding box prediction is not precise enough.
SiamRPN, DaSiamRPN, and SiamRPN++
SiamRPN  addresses this limitation by introducing a region proposal network (RPN) after the original Siamese subnetwork for feature extraction. The RPN jointly learns a classification branch and regression branch for region proposals and then selects the best proposal as the prediction according to the classification branch prediction.
SiamRPN has achieved great performance on challenging benchmarks. For example in VOT2015, SiamRPN is able to conduct at 160 FPS which is nearly two times of SiamFC(86 FPS), while gains 23% relative increase in EAO(expected average overlap). In VOT2017, SiamRPN surpass SiamFC by 33% according to EAO. Since then, much work has been done to further improve it. For example:
- DaSiamRPN  was proposed to improve the model’s ability to deal with distractors in cluttered backgrounds.
- SiamRPN++  and SiamDW  improve SiamRPN from the perspective of better model architecture design. For example, in SiamRPN++, they introduced ResNet-drive Siamese tracker, and proposed to perform layer-wise and depth-wise aggregations, which leads to significant performance gain.
One notable weakness of SiamRPN is that its performance is very sensitive to the hyperparameters of anchors used in training. This means that it needs to be carefully tuned to achieve ideal performance.
To avoid the tricky hyperparameter tuning and reduce human intervention, SiamCAR  has been proposed. Free of both proposals and anchors, SiamCAR takes one unique response map to predict an object’s location and its bounding box. Observing that locations far away from the center of a target tend to produce low-quality prediction, they also add a center-ness branch to remove outliers. With these changes, the number of hyperparameters is significantly reduced, resulting in a more accurate and faster tracker. In the GOT-10K dataset, SiamCAR is much faster than most evaluated trackers, and it improves the average overlap(AO) score by 5.2% compared to SiamRPN++.
Overcoming unique challenges introduced by diverse datasets
While applying object tracking in Sama for our video annotation offering, we realized that designing sophisticated model architecture for training was not enough to obtain a high-performance tracker.
Different datasets introduce unique challenges that demand customized inference techniques.
Tackling low frame rates or high-speed objects
When we introduced the search window concept earlier, we made the assumption that the object in adjacent frames wouldn’t change drastically in terms of location and size. However, there are cases in which a video will consist of low frame rates or objects moving at a very fast speed, making it challenging to find a suitable search window.
To overcome this limitation, it is possible to use Kalman Filter to predict where the tracking object would be in the next frame based on a learned trajectory. In cases where no clear trajectory can be found (such as a mosquito flying unpredictably in any direction at any time), other techniques such as cross-correlation can come to the rescue.
Updating the tracking object template
Typically, for the sake of efficiency, an object template is initialized in the first frame and is fixed for the remainder of the video tracking. However, when there is a drastic appearance change (for example, the appearance of a running dog can change drastically from one frame to the next), failing to update the template could lead to early failure of the tracker.
To address this issue, UpdateNet  takes the initial template, accumulated template, and the template of the current frame as inputs, and estimates the optimal template for the next frame using a convolutional neural network.
Single object tracking in the Sama platform
Manual annotation for a large volume of video frames is not only time-consuming; it can also be quite expensive.
The good news is that the field of visual object tracking is flourishing, with many papers published per year and even challenges to pit methods against one another. Sama’s team of ML researchers and scientists is dedicated to following and contributing to the latest techniques to equip our annotators with best-in-class trackers that ultimately increase their productivity.
Learn more about Sama ML Object Tracking for Video Annotation, which balances labeling speed and accuracy so our clients can bring their models to market faster.
Special thanks to Arman Kizilkale, Bingqing Yu, Pascal Jauffret, and Frederic Ratle for their contributions to the literature review, research, and development mentioned in this article. Thanks to Jean-François Marcil and Megan McNeil for their contributions to this article.
- Bertinetto, Luca, et al. “Fully-convolutional siamese networks for object tracking.” European conference on computer vision. Springer, Cham, 2016.
- Li, Bo, et al. “High performance visual tracking with siamese region proposal network.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
- Zhu, Zheng, et al. “Distractor-aware siamese networks for visual object tracking.” Proceedings of the European Conference on Computer Vision (ECCV). 2018.
- Li, Bo, et al. “Siamrpn++: Evolution of siamese visual tracking with very deep networks.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
- Zhang, Zhipeng, and Houwen Peng. “Deeper and wider siamese networks for real-time visual tracking.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
- Guo, Dongyan, et al. “SiamCAR: Siamese fully convolutional classification and regression for visual tracking.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
- Zhang, Lichao, et al. “Learning the model update for siamese trackers.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.