Short intro from our CEO, Steve Callanan:
'Since the very beginning of WIREWAX we've been constantly innovating and pushing technology out of its comfort zone to revolutionize how we process, distribute and interact with video. We've never stopped experimenting, developing and inventing new things. Our technology has won many awards, been adopted by many others and is repeatedly used in teaching to demonstrate the potential of modern web technologies. Where possible, we will share our discoveries and research with you too; for example, our step-by-step tutorial to build your own adaptive streaming video player has been shared over 38,000 times and continues to help thousands of budding developers and even corporations creating their own players. We believe sharing these discoveries is the right thing to do. We'll all be better off working together and sharing knowledge to build bigger, better and faster technologies that benefit us all.'
'As world leaders in motion-tracking hotspot technology, the ability for our machines to track the motion of moving people and objects in the video, through a variety of challenging conditions (changes in orientation, lighting, size and resolution), is one of the biggest challenges we face. The results must be impeccable and we constantly strive for perfection. However, there are rare times when a human may need to provide a little assistance and whenever this is required, we want to make sure we only ask very little, just a point in the right direction and our machines will take over. This latest post by our exceptional Lead Vision Scientist is another chance for you to see how we tackled this and invented something to solve one of computer vision's biggest bugbears. Warning, it's tier 1 science stuff...'
Author: Tom Rackham, Lead Vision Scientist at WIREWAX
While WIREWAX has one of the world’s most robust and accurate motion tracking engines, there are rare occasions when the tracked motion of a person or object needs to be manually corrected. The Edit Tracking tool is designed to minimise the amount of manual work required to make amendments to a track and to second-guess the corrections you’re making.
When an adjustment to any frame in the path is made, behind the scenes an intelligent interpolation method is used to figure out the best new track position. The repositioning of the bounding box creates a new visual reference of your object that may differ from the visual reference provided when the object was tagged originally. Between these two or more references a model can be created and given these two or more manually defined positions, the probability of the position between these points can be predicted using a 4D construct of the temporal movement of that model.
Separating out the frames into a third dimension (time) and mapping a heatmap of probability reveals a four dimensional probability map. A line of most probable position can then be drawn along the three dimensions providing a two dimensional position path. You can see a visualisation of the 4D probability map below;
Figure 1: Visualisation of the 4D Probability map This method means that you may only need to provide one or two additional visual references at evenly spaced times along your path for a very accurate interpolation between anchor points to be generated.
The method is limited to finding the best path within an optimisation range; the section of curve between the two anchor points on either side of the most recently edited position anchor. An initial guess of the current track segment within between anchor frames is calculated using a univariate spline between a set of anchor and knot points along the tracking path. While anchor points are fixed points that the curve is required to pass through, knots are used to guide the curvature of the path. Anchor points used for the spline are those selected and modified by the user in the Edit Tracking application. Knots are created automatically along the length of the path, with increased likelihood of occurring at points of high curvature. A linear interpolation of the bounding box size at each point in this new track is also calculated to fill in any missing frames.
The Euclidean distance of the repositioned anchor point relative to bounding box size is calculated; if this is greater than 1.5, a Dynamic Programming method is used for track optimisation. Otherwise, Global Optimisation is used.
An important input for both dynamic and global optimisation methods is the Appearance Volume. This is a model trained on the properties of image pixels within bounding boxes at the anchor points that occur within the optimisation range.
For image frame along the segment of tracking curve, candidate bounding boxes are created; these are boxes of centred at every point of the original bounding box, with the same dimensions as the original. The K-Nearest Neighbours of these candidate boxes to the learned features are found using squared norm distance. Such results are stored in a volume, the size of which is the image height x image width x number of frames across the optimisation range. The volume is set to zero outside of 1.5 x the dimensions of the bounding box.
Figure 2: Cross section of the Appearance Volume (in the x dimension) with increasing deviation of a central anchor point.
Figure 2 shows how a cross section of the volume changes when varying the distance of the anchor point from its initial position. Since the volume is calculated by comparing the pixels within the bounding box at the anchor point to pixels surrounding the existing path, larger movements in the bounding box position result in increased dissimilarity along the path.
The core components of the optimisation calculation are as follows:
trackDist is the distance between the new track and the original. This cost term is designed to limit vast deviation from the initially detected trajectory. Weighted by wDist (default = 5). app is the dissimilarity between the track appearance and the appearance model. This cost encourages the track to follow the path which has the greatest visual similarity to the targeted regions at the anchor points. Weighted by wApp (default = 1.4). C1 is the curve length of the new track. This prevents the new track from adopting a long path with wide deviation. Weighted by wC1 (default = 1). C2 is the smoothness of the new track. This aims to limit sharp changes in trajectory, but is not currently used as part of the implementation. Weighted by wC2 (default = 0).
For small changes in anchor point position with respect to the bounding box size, a straightforward global optimisation method is used. In this case, a track is created with the aim of globally minimising the value of F, defined as;
F = wDist x trackDist + wApp x app + wC1 x C1 + wC2 x C2 wDist+wApp+wC1+wC2
using the initially interpolated track as a starting point. At each iteration of the minimisation process, the curve’s knots are repositioned and a new interpolated track is created (retaining the same anchor points as before). F is recalculated for this curve until minimisation conditions are satisfied; in our case, that the change in value of F between successive iterations is less than 0.005.
In the case of large changes of position from the initial track, a Dynamic Optimisation method is used. The steps taken to complete this optimisation are:
- An empty volume is created with the same dimensions as the Appearance Volume.
- The locations where an anchor point exists are set to 100, thereby always making them the first candidate locations for the new track position in their respective frames.
- For each frame, the top K candidate locations for the track position are selected based on their values in the Appearance Volume (K=500).
- Costs for each frame are calculated for each of the K candidate locations in the frame as; C = wDist x trackDist + app x wApp
- For each candidate point in a new frame, the cost in relation to the every candidate point in the previous frame is calculated as;P = wDist x candidateDist x distWeight+ wApp x app + 0.01 x wC1 x C1where candidateDist is the distance between the current and previous candidate points.
- The lowest non-zero value of this vector is added to the current C, and the location of this point within the previous frame is stored. The total represents the minimum possible cost to move from the previous frame to the current one, using that particular candidate point.
- Now, every point in every frame has a cost-based connection to a point in the previous frame. The lowest cost path through the entire volume can be found by backtracking from the final frame to the first frame, at each stage finding the coordinates for the location with the lowest previous value of P.
- In cases where the final frame contains an anchor point, this backtracking process begins at that point. Otherwise, it begins at K=0 (the most likely candidate frame).
Figure 3 shows a standard interpolation (red) next to a 4D Interpolated track (green) when the original track (black) is offset. This can be seen for both forms of optimisation (although this example would normally default to Dynamic Optimisation). With both examples, the influence of the Appearance Volume on the position of the optimised track can be seen; as a different appearance has been learnt for the bounding box at the anchor point, the new track remains below the original.
(a) Global Optimisation
(b) Dynamic Optimisation
Figure 3: Initial track with altered anchor point (black), standard spline-based interpolation of the anchor points (red) and the final optimised path (green) With the use of 4D Interpolation, image data is used to reevaluate and create a track closer to the user’s intention, rather than just spatially interpolating between a set of user-defined points. This should result in fewer anchor point modifications, and ultimately a far better user experience.
(a) Before 4D Interpolation
(b) After 4D Interpolation
Figure 4: An example of object tracking before and after 4D Interpolation is used to clean up the trajectory