Pytorch implementation of "Fully-Convolutional Siamese Networks for Object Tracking"
This project is the Pytorch implementation of the object tracker presented in
Fully-Convolutional Siamese Networks for Object Tracking,
also available at their project page.
The original version was written in matlab with the MatConvNet framework, available
here (trainining and tracking), but this
python version is adapted from the TensorFlow portability (tracking only),
available here.
The project is divided into three major parts: Training, Tracking and a Visualization Application.
The main focus of this work, the Training part deals with the training of the Siamese Network in order to learn a similarity metric between patches, as
described in the paper.
The only training metric comparable between different parameters and implementations is the average center error (or center displacement). The authors provide this metric in appendix B of their paper Learning feed-forward one-shot learners, which is 7.40 pixels for validation and 6.26 pixels for training.
Our Baseline Results are shown below:
We are around 4 pixels behind the authors, which we hypothesize that is mainly due to:
On the other hand, we are way less prone to overfitting, because we sample the pairs differently on each training epoch, as opposed to the authors, that choose all the pairs beforehand and use the same pairs on each training epoch.
This trained model is made available as BaselinePretrained.pth.tar.
pip install pyqt5
pip install pyqtgraph
In case you have Anaconda, install the conda virtual environment with:
conda env create -f environment.yaml
conda activate siamfc
pip install pyqtgraph
(**OPTIONAL:** To accelerate the dataloading refer to [this section](#accelerating-data-loading))
2. Download the ImageNet VID Dataset in http://bvisionweb1.cs.unc.edu/ILSVRC2017/download-videos-1p39.php and extract it on the folder of your choice (*OBS: data reading is done in execution time, so if available extract the dataset in your SSD partition*). You can get rid of the *test* part of the dataset, since it has no Annotations.
3. For each new training we must create an *experiment folder* (the folder stores the training parameters and the training output):
cd training/experiments
mkdir
cp default/parameters.json
4. Edit the *parameters.json* file with the desired parameters. The description of each parameter can be found [here](#training-parameters).
5. Run the *train.py* script:
python train.py —data_dir
<center>
<figure>
<img src="images/quick_train_screen.gif" height="60%" width="100%">
<figcaption>This gif illustrates the execution of the training script. It uses very few epochs just to give a feel of the execution. A serious training execution could take a whole day.
</figcaption>
</figure>
</center>
* Use `--time` in case you want to profile the execution times. They will be saved in the train.log file.
6. The outputs will be:
* `train.log`: The log of the training, most of which is also displayed in the terminal.
* `metadata.train` and `metadata.val`: The metadata of the training and validation datasets, which is written on the start of the program. **Simply copy these files to any new experiment folder to save time on set up (about 15 minutes in my case).**
* `metrics_val_last_weights.json`: The json containing the metrics of the most recent validation epoch. Human readable.
* `metrics_val_best_weights.json`: The json containing the metrics of the validation epoch with the best AUC score. Human readable.
* `best.pth.tar` and `last.pth.tar`: Dictionary containing the state_dictionary among other informations about the model. Can be loaded again later, for
training, validation or inference. Again *last* is the current epoch and *best* is the best one.
* `tensorboard`: Folder containing the **tensorboard** files summarizing the training. It is separated in a *val* and a *train* folder so that the curves can be plotted in the same plot. To launch it type:
tensorboard —logdir
<center>
<figure>
<img src="images/scalars.png" height="60%" width="60%">
<figcaption>The three metrics stored are the mean <b>AUC</b> of the the ROC curve of the binary classification error between the label correlation map (defined by the parameters) and the actual correlation map, as well as the <b>Center Error</b>, which is the distance in pixels between the peak position of the correlation map and the actual center. Lastly, we also plot the mean <b>Binary Cross-Entropy Loss</b>, used to optimize the model.
</figcaption>
</figure>
</center>
**OBS**: Our definition of the loss is slightly different than the author's, but should be equivalent. Refer to [Loss Definition](#loss-definition).
<center>
<figure>
<img src="images/pairs.png" height="60%" width="60%">
<figcaption>We also store Ref/Search/Correlation Map trios for each epoch, for debugging and exploration reasons. They are collected in each validation epoch, thus the first image corresponds to the validation before th e first training epoch. They allow us to see the evolution of the network's estimation after each training epoch. </figcaption>
</figure>
</center>
**OBS:** To set the number of trios stored in each epoch, use the `--summary_samples <NUMBER_OF_TRIOS>` flag:
python train.py -s 10 -d
The images might take a lot of space though, especially if the number of epochs is large.
### Additional Uses
#### Retraining/Loading Pretrained Weights
You can continue training a network or load pretrained weights by calling the train script with the flag `--restore_file <NAME_OF_MODEL>` where <NAME_OF_MODEL> is the filename **without** the *.pth.tar* extension (e.g. *best*, for *best.pth.tar*). The program then searchs for the file NAME_OF_MODEL.pth.tar inside the experiment folder and loads its state as the initial state, the rest of the training script continues normally.
python train.py -r
#### Evaluation Only
Once you finished training and dispose of a *.pth.tar file containing the network's weigths, you can evaluate it on the dataset by using the `--mode eval` combined with `--restore_file <NAME_OF_MODEL>`:
python train.py -m eval -r
The results of the evaluation are then stored in `metrics_test_best.json`.
### Tracking
[Under Development]
## Datasets
The dataset used for the tracker evaluation is a compilation of sequences from
the datasets [TempleColor](http://www.dabi.temple.edu/~hbling/data/TColor-128/TColor-128.html),
[VOT2013, VOT2014, and VOT2016](http://www.votchallenge.net/challenges.html). The Temple
Color and VOT2013 datasets are annotated with upright rectangular bounding boxes
(4 numbers), while VOT2014 and VOT2016 are annotated with rotated bounding boxes
(8 numbers). The order of the annotations seems to be the the following:
* TC: LowerLeft(x , y), Width, Height
* VOT2013: LowerLeft(x, y), Width, Height
* VOT2014: UpperLeft(x, y), LowerLeft(x, y), LowerRight(x, y), UpperRight(x, y)
* VOT2016: LowerLeft(x, y), LowerRight(x, y), UpperRight(x, y), UpperLeft(x, y)
OBS: It is possible that the annotations in VOT204 and 2016 simply represent a
sequence of points that define the contour of the bounding box, in no particular
order (but respecting the adjency of the the points in the rectangle). I didn't
check all the ground-truths to guarantee that all the annotations are in the
particular order described here. If something goes very wrong, you might want to
confirm this.
The dataset used for the training is the 2015 ImageNet VID, which contains
videos of targets where each frame is labeled with a bounding box around the
targets.
## Visualization Application
<center>
<figure>
<img src="images/viz_app.gif" height="60%" width="80%">
<figcaption>
</figcaption>
</figure>
</center>
I've also developped a visualization application to see the output of the network in real time. The objective is to observe the merits and limitations of the network **as a similarity function**, **NOT** to evaluate nor visualize the complete tracker.
The vis_app simply compares the reference embedding **with the whole frame** for each frame in a given sequence, instead of comparing it with a restricted area like the tracker. The basic pipeline is:
* Get the first frame in which the target appears.
* Calculate a resizing factor such that the reference image (bounding box + context region) has 127 pixels, just as during training. All the subsequent frames will be resized as well.
* Pass the reference image through the embedding branch to get the reference embedding. There is **no update** of the reference embedding.
* For each frame of the sequence we patch the sides with zeros, so that the output dimensions of the score map have the same dimensions as the frame itself.
* Convolve the the ref embedding with the frame's embedding (which can be quite big) to get the full score map. The score map is upscaled to "undo" the effect of the stride and is overlaid over the frame. The score map's intensity is encoded to color with the [inferno colormap](https://matplotlib.org/users/colormaps.html).
* We also plot in each frame the peak of the score map as the network's guess of the current target position. We plot the Center Error curve as well.
The application is implemented using two threads: the **display thread** that takes care of the GUI and the **producer thread** that does all the computations.
<center>
<figure>
<img src="images/first_line.png" height="60%" width="100%">
<figcaption> In the top of the window we show the current fps rate of the display; the visibility status of the target (true if the target is inside the frame); and the number of frames in the buffer that stores the frames produced by the produced thread.
</figcaption>
</figure>
</center>
<center>
<figure>
<img src="images/second_line.png" height="60%" width="80%">
<figcaption> In the bottom of this image we have the full path to the current frame; the reference image shows the image being used as reference by the network; The ground truth shows the current frame with the bounding box overlaid; and the Score Map shows the score map calculated by the network over the current frame, the amount of each can be controlled with an <b>alpha argument</b>. A green cross is placed in the peak of the score map.
</figcaption>
</figure>
</center>
<center>
<figure>
<img src="images/third_line.png" height="60%" width="100%">
<figcaption> Finally we have a plot of the Center Error per frame, being the distance between the score map's peak and the center of the ground thruth bounding box. We Draw a blue line in 63 pixels (half of the Reference Image side), that serves as a threshold over which the position might be outside the bounding box.
</figcaption>
</figure>
</center>
### How to Run - Visualization Application
Assuming the ImageNet VID dataset is the path \<ILSVRC>, and that the PyQt5 and pyqtgraph packages are installed:
python vis_app.py -d ``
Also, the sequences are separated into
trainand
valsequences according to the dataset's organization, and numbered in increasing order. So to get the 150th training sequence, use the arguments
-t train -s 149(index starts at zero). The
trainsequences range from 0 to 3861 and the
val` from 0 to 554.
You can also set the alpha value from 0.0 to 1.0, that controls how much of the frame is overlaid with the score map. To get the score map without the frame behind it use the argument -a 1
. To see all the arguments use python vis_app.py -h
Both tracking and training scripts are defined in terms of user-defined parameters,
which define much of their behaviour. The parameters are defined inside .json files
and can be directly modified by the user. In both tracking and training a given
set of parameters defines what we call an experiment
and thus they are placed
inside folders called experiments inside both training and tracking.
To define new parameters for a new experiment, copy the default experiment folder
with its .json files and name it accordingly, placing it always inside the
training(or tracking)/experiments folder. Here below we give a brief description
of the basic parameters:
model
: The Embedding Network to be used as the branch of the Siamese Network, the models are defined in models.py. The models available are BaselineEmbeddingNet, VGG11EmbeddingNet_5c, VGG16EmbeddingNet_8c.parameter_freeze
: A list of the layers of the Embedding Net that will be frozen (parameters will not be change). The numeration refers the nn.Sequential class that defines the Network in models.py. E.g. [0, 1, 4, 5] with BaselineEmbeddingNet freezes the two first convolutional layers (0 and 4) along with the two first BatchNorm layers (1 and 5).batch_size
: The batch size in terms of reference/search region pairs. Thenum_epochs
: Total number of training epochs. One validation epoch is done before training starts and after each training epoch.train_epoch_size
: The number of iterations for each train epoch. If it iseval_epoch_size
: The number of iterations for each validation epoch. If itsave_summary_steps
: The number of batches between the metrics evaluation.optim
: The optimizer to be used during training. Options include SGD foroptim_kwargs
: The keywords associated with each optimizer’s initialization.SGD
we could specify it aslr
: 1e-3, momentum
:0.9}, for a learning rate of 0.001 and momentummax_frame_sep
: The maximum frame separation, the maximum distance betweenreference_sz
: The reference region size in pixels. Default is 127.search_sz
: The search region size in pixels. Default is 255.final_sz
: The final size after the pairs are passed throught the model.upscale
: A boolean to indicate if the network should have a bilinear upscalepos_thr
: The positive threshold distance in the label, the threshold ofneg_thr
: The negative threshold distance in the label, the threshold ofcontext_margin
: The context margin for the reference region.[Under Development]
[Coming Soon]
[Coming Soon]
ACKNOWLEDGEMENTS:
This project was originally developed and open-sourced by Parrot Drones SAS.