Sunday 13 January 2019

Deep learning 12-Train a detector based on yolo v3(by gluoncv) by custom data

    GluonCV come with lots of useful pretrained model for object detection, including ssd, yolo v3 and faster-rcnn. Their website come with an example to show you how to fine tune your own data set with ssd, but they do not show us how to do it with yolo v3. If you were like me, struggling in training your custom data with yolo v3, this post may ease your pain since I already modify the script to help you train your custom data.

1.Select a tool to draw your bounding box

 I use labelImg for my purpose, easy to install and use on windows and ubuntu.

  2.Convert  the xml files generated by labelImg to lst format

I write some small classes to perform the conversion task, you can find them on github, if you cannot compile it, please open an issue at github.You don't need opencv and mxnet for this task after all.

3.Convert lst file to rec format

 Follow the instructions at here, study how to use should be enough.

 4. Adjust the, make it able to read file with rec format

I do this part for you already, you can download the script( from github. Before you use that, you will need to

  1. Copy on github
  2. Change the file name to
  3. Move it to the folder of gluoncv.utils.metrics(mine is C:\my_folder\Anaconda3\Lib\site-packages\gluoncv\utils\metrics)
  4. Change the codes from from gluoncv.utils.metrics.voc_detection import VOC07MApMetric to from gluoncv.utils.metrics.voc_detection_2 import VOC07MApMetric 
    This is because the Anaconda on windows got bug, if your is fine, you can omit this step and change from gluoncv.utils.metrics.voc_detection_2 to import VOC07MApMetric gluoncv.utils.metrics.voc_detection import VOC07MApMetric

  You can prefer to install nightly release too if you want to save some troubles.

5.Enter command to train your data

I add a few command line options in this script, they are

--train_dataset : Location of the rec file for training
--validate_dataset: Location of the rec file validate
--pretrained: If you enter this, it will use the pretrained weights of coco dataset, else only use the pretrained weights of imageNet.
--classes_list : Location of the file with the name of classes. Every line present one class, each class should match with their own id. Example :


ID of face is 0 so it is put at line 0, ID of person is 1 so it is put at line 1

    Example : python --epochs 20 --lr 0.0001 --train_dataset face_person.rec --validate_dataset face_person.rec --classes_list face_person_list.txt --batch-size 3 --val-interval 5 --mixup


1. If you do not enter --no-random-shape, you better make your learning rate lower(ex : 0.0001 instead of 0.001), else it is very easy to explode(loss become nan).
2. Not every dataset works better with random-shape, run a few epoch(ex : 5) with smaller data(ex , 300~400 images) to find out which parameters works well.
3. Enable random-shape will eat much more ram, without it I can set my batch-size as 8, with it I could only set my batch-size as 3.

7. Measure performance

In order to measure the performance, we need a test set,unfortunately there do not exist a test set which designed for human and face detection, therefore I pick two data sets to measure the performance of the trained model. FDDB for face detection(the label of FDDB is more like head rather than face), Pascal Voc for human detection. You can find the at github.

    The model I used train with 40 epoch, the mAP of this on model on the training set is close to 0.9. Both of the experiments are based on IOU = 0.5.

7.1 Performance of face detection

    mAP close to 1.0 when IOU is 0.5, this looks too good for real, let us check the inference results by our eyes to find out what is happening. Following images are inference with input-shape as 320 with the model trained with 40 epochs.


    From these pictures(pic01~pic09), we know the model works quite well.

    Usually mAP of test would not higher than training set, either this test set is far too easy than the training set, or this detector is overfit to this test set. No matter what, this test set is not good enough to measure the performance of the face detector.

7.2 Performance of person detection

    Unlike face detector, mAP of person detector only got 0.583mAP on the images listed by person_val.txt(I only apply on the images contain person), there are still a big room to improve the accuracy.

     Adding more data may improve the performance since this test results tell us this model got high variance, in order to find out what kind of data we should add, one of the solution is study the mis-classify or the person cannot detected by eyes, then write down the reasons.

    After we gather the data, we can create a table, list out the name of the images and describe the errors(pic15, a small example).

    With the helps of error analysis like this, we could find out which part we should put more focus into, and what kind of data we should collect. From the experiments we can find out accuracy of yolo v3 is very high, although recall got lots of space to improve.

8. Model and data

    You can find the model and data at mega. I do not put the data with images but the annotations only, you need to download the images by yourself(since I worry this may have legal issues if I publish the data). Not only added the bounding boxes for person, I also adjust the bounding boxes of faces a lot, original bounding boxes provided by kaggle are more like designed for "head detector" rather than "face detector".

9. Conclusion

    This detector got a lot of space to improve, especially the mAP of human, but it will take a lot of times to do it so I decide to stop at here. My annotations of the humans, you can find a lot of them are overlap with other person a lot, bounding boxes without overlap with the person may help the models detect more persons.

    You can use the annotations and model as your free will, do  me a favor by reference this site if you do use them, thanks.

    The source codes can find at github.