Sunday, 13 January 2019

Train a detector based on yolo v3(by gluoncv) by custom data

    GluonCV come with lots of useful pretrained model for object detection, including ssd, yolo v3 and faster-rcnn. Their website come with an example to show you how to fine tune your own data set with ssd, but they do not show us how to do it with yolo v3. If you were like me, struggling in training your custom data with yolo v3, this post may ease your pain since I already modify the script to help you train your custom data.

1.Select a tool to draw your bounding box

 I use labelImg for my purpose, easy to install and use on windows and ubuntu.

  2.Convert  the xml files generated by labelImg to lst format

I write some small classes to perform the conversion task, you can find them on github, if you cannot compile it, please open an issue at github.You don't need opencv and mxnet for this task after all.

3.Convert lst file to rec format

 Follow the instructions at here, study how to use should be enough.

 4. Adjust the, make it able to read file with rec format

I do this part for you already, you can download the script( from github. Before you use that, you will need to

  1. Copy on github
  2. Change the file name to
  3. Move it to the folder of gluoncv.utils.metrics(mine is C:\my_folder\Anaconda3\Lib\site-packages\gluoncv\utils\metrics)
  4. Change the codes from from gluoncv.utils.metrics.voc_detection import VOC07MApMetric to from gluoncv.utils.metrics.voc_detection_2 import VOC07MApMetric 
    This is because the Anaconda on windows got bug, if your is fine, you can omit this step and change from gluoncv.utils.metrics.voc_detection_2 to import VOC07MApMetric gluoncv.utils.metrics.voc_detection import VOC07MApMetric

  You can prefer to install nightly release too if you want to save some troubles.

5.Enter command to train your data

I add a few command line options in this script, they are

--train_dataset : Location of the rec file for training
--validate_dataset: Location of the rec file validate
--pretrained: If you enter this, it will use the pretrained weights of coco dataset, else only use the pretrained weights of imageNet.
--classes_list : Location of the file with the name of classes. Every line present one class, each class should match with their own id. Example :


ID of face is 0 so it is put at line 0, ID of person is 1 so it is put at line 1

    Example : python --epochs 20 --lr 0.0001 --train_dataset face_person.rec --validate_dataset face_person.rec --classes_list face_person_list.txt --batch-size 3 --val-interval 5 --mixup


1. If you do not enter --no-random-shape, you better make your learning rate lower(ex : 0.0001 instead of 0.001), else it is very easy to explode(loss become nan).
2. Not every dataset works better with random-shape, run a few epoch(ex : 5) with smaller data(ex , 300~400 images) to find out which parameters works well.
3. Enable random-shape will eat much more ram, without it I can set my batch-size as 8, with it I could only set my batch-size as 3.

7. Measure performance

In order to measure the performance, we need a test set,unfortunately there do not exist a test set which designed for human and face detection, therefore I pick two data sets to measure the performance of the trained model. FDDB for face detection(the label of FDDB is more like head rather than face), Pascal Voc for human detection. You can find the at github.

    The model I used train with 40 epoch, the mAP of this on model on the training set is close to 0.9. Both of the experiments are based on IOU = 0.5.

7.1 Performance of face detection

    mAP close to 1.0 when IOU is 0.5, this looks too good for real, let us check the inference results by our eyes to find out what is happening. Following images are inference with input-shape as 320 with the model trained with 40 epochs.


    From these pictures(pic01~pic09), we know the model works quite well.

    Usually mAP of test would not higher than training set, either this test set is far too easy than the training set, or this detector is overfit to this test set. No matter what, this test set is not good enough to measure the performance of the face detector.

7.2 Performance of person detection

    Unlike face detector, mAP of person detector only got 0.583mAP on the images listed by person_val.txt(I only apply on the images contain person), there are still a big room to improve the accuracy.

     Adding more data may improve the performance since this test results tell us this model got high variance, in order to find out what kind of data we should add, one of the solution is study the mis-classify or the person cannot detected by eyes, then write down the reasons.

    After we gather the data, we can create a table, list out the name of the images and describe the errors(pic15, a small example).

    With the helps of error analysis like this, we could find out which part we should put more focus into, and what kind of data we should collect. From the experiments we can find out accuracy of yolo v3 is very high, although recall got lots of space to improve.

8. Model and data

    You can find the model and data at mega. I do not put the data with images but the annotations only, you need to download the images by yourself(since I worry this may have legal issues if I publish the data). Not only added the bounding boxes for person, I also adjust the bounding boxes of faces a lot, original bounding boxes provided by kaggle are more like designed for "head detector" rather than "face detector".

9. Conclusion

    This detector got a lot of space to improve, especially the mAP of human, but it will take a lot of times to do it so I decide to stop at here. My annotations of the humans, you can find a lot of them are overlap with other person a lot, bounding boxes without overlap with the person may help the models detect more persons.

    You can use the annotations and model as your free will, do  me a favor by reference this site if you do use them, thanks.

    The source codes can find at github.





Thursday, 4 October 2018

Person detection(Yolo v3) with the helps of mxnet, able to run on gpu/cpu

    In this post I will show you how to do object detection with the helps of the cpp-package of mxnet. Why do I introduce mxnet? Because following advantages make it a decent library for standalone project development

1. It is open source and royalty free
2. Got decent support for GPU/CPU
3. Scaling efficiently to multiple GPU and machines
4. Support cpp api, which means you do not need to ask the users to install python environment , shipped the source codes in order to run your apps
5. mxnet support many platforms, including windows, linux, mac, aws, android, ios
6. It got a lot of pre-trained models
7. MMDNN support mxnet, which mean we can convert the models trained by different libraries to mxnet(although not all of the models could be converted).

Step 1 : Download model and convert it to the format can load by cpp package

1. Install anaconda(the version come with python3)
2. Install mxnet from the terminal of anaconda
3. Install gluon-cv from the terminal of anaconda
4. Download model and convert it by following scripts

import gluoncv as gcv
from gluoncv.utils import export_block

net = gcv.model_zoo.get_model('yolo3_darknet53_coco', pretrained=True)
export_block('yolo3_darknet53_coco', net)

 Step 2 : Load the models after convert

void load_check_point(std::string const &model_params,
                      std::string const &model_symbol,
                      Symbol *symbol,
                      std::map<std::string, NDArray> *arg_params,
                      std::map<std::string, NDArray> *aux_params,
                      Context const &ctx)
    Symbol new_symbol = Symbol::Load(model_symbol);
    std::map<std::string, NDArray> params = NDArray::LoadToMap(model_params);
    std::map<std::string, NDArray> args;
    std::map<std::string, NDArray> auxs;
    for (auto iter : params) {
        std::string type = iter.first.substr(0, 4);
        std::string name = iter.first.substr(4);
        if (type == "arg:")
            args[name] = iter.second.Copy(ctx);
        else if (type == "aux:")
            auxs[name] = iter.second.Copy(ctx);

    *symbol = new_symbol;
    *arg_params = args;
    *aux_params = auxs;

    You could use the load_check_point function as following

    Symbol net;
    std::map<std::string, NDArray> args, auxs;
    load_check_point(model_params, model_symbols, &net, &args, &auxs, context);

    //The shape of the input data must be the same, if you need different size,
    //you could rebind the Executor or create a pool of Executor.
    //In order to create input layer of the Executor, I make a dummy NDArray.
    //The value of the "data" could be change later
    args["data"] = NDArray(Shape(1, static_cast<unsigned>(input_size.height),
                                 static_cast<unsigned>(input_size.width), 3), context);
    executor_.reset(net.SimpleBind(context, args, std::map<std::string, NDArray>(),
                                   std::map<std::string, OpReqType>(), auxs));

    model_params is the location of the weights(ex : yolo3_darknet53_coco.params), model_symbols(ex : yolo3_darknet53_coco.json) is the location of the symbols saved as json.

Step 3: Convert image format

    Before we feed the image into the executor of mxnet, we need to convert them.

NDArray cvmat_to_ndarray(cv::Mat const &bgr_image, Context const &ctx)
    cv::Mat rgb_image;
    cv::cvtColor(bgr_image, rgb_image, cv::COLOR_BGR2RGB); 
    rgb_image.convertTo(rgb_image, CV_32FC3);
    //This api copy the data of rgb_image into NDArray. As far as I know,
    //opencv guarantee continuous of cv::Mat unless it is sub matrix of cv::Mat
    return NDArray(rgb_image.ptr<float>(),
                   Shape(1, static_cast<unsigned>(rgb_image.rows), static_cast<unsigned>(rgb_image.cols), 3),

Step 4 : Perform object detection on video

void object_detector::forward(const cv::Mat &input)
    //By default, input_size_.height equal to 256 input_size_.width equal to 320.
    //Yolo v3 has a limitation, width and height of the image must be divided by 32.
    if(input.rows != input_size_.height || input.cols != input_size_.width){
        cv::resize(input, resize_img_, input_size_);
        resize_img_ = input;

    auto data = cvmat_to_ndarray(resize_img_, *context_);
    //Copy the data of the image to the "data"
    //Forward is an async api.

Step 5 : Draw bounding boxes on image

void plot_object_detector_bboxes::plot(cv::Mat &inout,
                                       std::vector<mxnet::cpp::NDArray> const &predict_results,
                                       bool normalize)
    using namespace mxnet::cpp;

    //1. predict_results get from the output of Executor(executor_->outputs)
    //2. Must Set Context as cpu because we need process data by cpu later
    auto labels = predict_results[0].Copy(Context(kCPU, 0));
    auto scores = predict_results[1].Copy(Context(kCPU, 0));
    auto bboxes = predict_results[2].Copy(Context(kCPU, 0));
    //1. Should call wait because Forward api of Executor is async
    //2. scores and labels could treat as one dimension array
    //3. BBoxes can treat as 2 dimensions array

    size_t const num = bboxes.GetShape()[1];
    for(size_t i = 0; i < num; ++i) {
        float const score = scores.At(0, 0, i);
        if (score < thresh_) break;

        size_t const cls_id = static_cast<size_t>(labels.At(0, 0, i));
        auto const color = colors_[cls_id];
        //pt1 : top left; pt2 : bottom right
        cv::Point pt1, pt2;
        //get_points perform normalization
        std::tie(pt1, pt2) = normalize_points(bboxes.At(0, i, 0), bboxes.At(0, i, 1),
                                              bboxes.At(0, i, 2), bboxes.At(0, i, 3),
                                              normalize, cv::Size(inout.cols, inout.rows));
        cv::rectangle(inout, pt1, pt2, color, 2);

        std::string txt;
        if (labels_.size() > cls_id) {
            txt += labels_[cls_id];
        std::stringstream ss;
        ss << std::fixed << std::setprecision(3) << score;
        txt += " " + ss.str();
        put_label(inout, txt, pt1, color);

     I only mentioned the key points in this post, if you want to study the details, please check it on github.

Saturday, 29 September 2018

Install cpp package of mxnet on windows 10, with cuda and opencv

    Compile and install cpp-package of mxnet on windows 10 is a little bit tricky when I writing this post.

     The install page of mxnet tell us almost everything we need to know, but there are something left behind haven't wrote into the pages yet, today I would like to write down the pitfalls I met and share with you how do I solved them.


1. Remember to download the mingw dll from the openBLAS download page, put those dll into some place could be found by the system, else you wouldn't be able to generate op.h for cpp-package.

2. Install Anaconda(recommended) or the python package on the mxnet install page on your , machines and register the path(the path with python.exe), else you wouldn't be able to generate op.h for cpp-package.

3. Compile the project without cpp-package first, else you may not able to generate op.h.

Cmake command for reference, change it to suit your own need

a : Run these command first

cmake -G "Visual Studio 14 2015 Win64" ^

cmake --build . --config Release

b : Run these command, with cpp package on

cmake -G "Visual Studio 14 2015 Win64" ^

cmake --build . --config Release --target INSTALL

4. After you compile and install the libs, you may find out you missed some headers in the install
path, I missed nnvm and mxnet-cpp. What I did is copy the folders to the install folder.

    Hope these could help someone who is pulling their head when compile cpp-package of mxnet on windows 10.

Saturday, 4 August 2018

Qt and computer vision 2 : Build a simple computer vision application with Qt5 and opencv3

    In this post, I will show you how to build a dead simple computer vision application with Qt Creator and opencv3 step by step.

Install opencv3.4.1(or newer version) on windows

0. Go to source forge, download prebuild binary of opencv3.4.2. or you could build it by yourself

1. Double click on the opencv-3.4.2-vc14_vc15.exe and extract it to your favorite folder(pic_00)


2. Open the folder you extract(assume you extract it to /your_path/opencv_3_4_2). You will see a folder call "opencv" .

3. Open your QtCreator you installed.

Create a new project by Qt Creator

4. Create a new project

5. You will see a lot of options, for simplicity, let us choose "Application->Non-Qt project->Plain c++ application". This tell the QtCreator, we want to create a c++ program without using any Qt components.


6. Enter the path of the folder and name of the project.

7. Click the Next button and use qmake as your build system by now(you can prefer cmake too, but I always prefer qmake when I am working with Qt).

8. You will see a page ask you to select your kits, kits is a tool QtCreator use to group different settings like device, compiler, Qt version etc.

9. Click on next, QtCreator may ask you want to add to version control or not, for simplicity, select None. Click on finish.

10. If you see a screen like this, that means you are success.


11. Write codes to read an image by opencv

#include <iostream>

#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>

//propose of namespace are
//1. Decrease the chance of name collison
//2. Help you organizes your codes into logical groups
//Without declaring using namespace std, everytime when you are using
//the classes, functions in the namespace, you have to call with the
//prefix "std::".
using namespace cv;
using namespace std;

 * main function is the global, designated start function of c++.
 * @param argc Number of the parameters of command line
 * @param argv Content of the parameters of command line.
 * @return any integer within the range of int, meaning of the return value is
 * defined by the users
int main(int argc, char *argv[])
    if(argc != 2){
        cout<<"Run this example by invoking it like this: "<<endl;
        cout<<"./step_02.exe lena.jpg"<<endl;
        return -1;

    //If you execute by Ctrl+R, argv[0] == "step_02.exe", argv[1] == lena.jpg

    //Open the image
    auto const img = imread(argv[1]);
        imshow("img", img); //Show the image on screen
        waitKey(); //Do not exist the program until users press a key
        cout<<"cannot open image:"<<argv[1]<<endl;

        return -1;

    return 0; //usually we return 0 if everything are normal

How to compile and link the opencv lib with the help of Qt Creator and qmake

  Before you can execute the app, you will need to compile and link to the libraries of opencv. Let me show you how to do it. If you missed steps a and b, you will see a lot of error messages like Pic07 or Pic09 show.

12. Tell the compiler, where are the header files, this could be done by adding following command in the

INCLUDEPATH += your_install_path_of_opencv/opencv/opencv_3_4_2/opencv/build/include

  The compiler will tell you it can't locate the header files if you do not add this line(see Pic07).


  If your INCLUDEPATH is correct, QtCreator should be able to find the headers and use the auto complete to help you type less words(Pic08).


13. Tell linker which libraries of the opencv it should link to by following command.

LIBS += your_install_path_of_opencv/opencv/opencv_3_4_2/opencv/build/x64/vc14/lib/opencv_world342.lib

Without this step, you will see the errors of "unresolved external symbols"(Pic08).

14. Change from debug to release.


  Click the icon surrounded by the red region and change it from debug to release. Why do we do that? Because

  • Release mode is much more faster than debug mode in many cases
  • The library we link to is build as release library, do not mixed debug and release libraries in your project unless you are asking for trouble
  I will introduce more details of compile, link, release, debug in the future, for now, just click Ctrl+B to compile and link the app.

Execute the app

  After we compile and link the app, we already have the exe in the folder(in the folder show at Pic11).


  We are almost done now, just few more steps the app could up and run.

13. Copy the dll opencv_world342.dll and opencv_ffmpeg342_64.dll(they place in /your_path/opencv/opencv_3_4_2/opencv/build/bin) into a new folder(we called it global_dll).

14. Add the path of this folder into system path. Without step 13 and 14, the exe wouldn't be able to find the dll when we execute the app, and you may see following error when you execute the app from command line(Pic12). I recommend you use the tool--Rapid environment editor(Pic13) to edit your path on windows.


15. Add command line argument in the QtCreator, without it, the app do not know where is the image when you click Ctrl+R to execute the program.

16. If success, you should see the app open an image specify from the command line arguments list(Pic15).


  These are easy but could be annoying at first. I hope this post could leverage your frustration. You can find the source codes located at github.

Sunday, 22 April 2018

Qt and computer vision 1 : Setup environment of Qt5 on windows step by step

    Long time haven't updated my blog, today rather than write a newer, advanced deep learning topics like "Modern way to estimate homography matrix(by lightweight cnn)" or "Let us create a semantic segmentation model by PyTorch", I prefer to start a series of topics for new comers who struggling to build a computer vision app by c++. I hope my posts could help more people find out  use c++ to develop application could as easy as another "much easier to use languages"(ex : python).

    Rather than introduce you most of the features of Qt and opencv like other books did, these topics will introduce the subsets of Qt which could help us develop decent computer vision application step by step.

c++ is as easy to use as python, really?

    Many programmers may found this nonsense, but my own experience told me it is not, because I never found languages like python, java or c# are "much easier to use compare with c++". What make our perspective become so different? I think the answers are

1. Know how to use c++ effectively.
2. There exist great libraries for the tasks(ex : Qt, boost, opencv, dlib, spdlog etc).

    As long as these two conditions are satisfy, I believe many programmers will have the same conclusion as mine. I will try my best to help you learn how to develop easy to maintain application by c++ in these series, show you how to solve those "small yet annoying issues" which may scare away many new comers.

Install visual c++ 2015

    Some of you may ask "why 2015 but not 2017"? Because when I am writing this post, cuda do not have decent support on visual c++ 2017 or mingw yet, cuda is very important to computer vision app, especially when deep learning take over many computer vision tasks today.

1. Go to this page, click on the download button of visual studio 2015.

2. Download visual studio 2015 community(you may need to open an account before you can enter this page)

3. Double click on the exe "en_visual_studio_community_2015_with_update_3_x86_x64_web_installer_xxxx" and wait a few minutes.

4. Install visual c++ tool box as following shown, make sure you select all of them.

    Install Qt5 on windows

1. Go to the download page of Qt
2. Select open source version of Qt

3. Click download button, wait until qt-unified-windows downloaded.

4. Double click on the installer, click next->skip->next

5. Select the path you want to install Qt

6. Select the version of Qt you want to install, every version of Qt(Qt5.x) have a lot of binary files to download, only select the one you need. We prefer to install Qt5.9.5 at here. Why Qt5.9.5? Because Qt5.9 is a long term support version of Qt, in theory long term support should be more stable.

7. Click next and install.

Test Qt5 installed or not

1. Open QtCreator and run an example. Go to your install path(ex: C:/Qt/3rdLibs/Qt), navigate to your_install_path/Tools/QtCreator/bin and double click on the qtcreator.exe.

2. Select Welcome->Example->Qt Quick Controls 2 - Gallery

3. Click on the example, it may pop out a message box to ask you some questions, you can click on yes or no.

4. Every example you open would pop out help page like this, keep it or not is your choices, sometimes they are helpful.

5. First, select the version of Qt you want to use(surrounded by red bounding box). Second, keep the shadow build option on(surrounded by green bounding box), why keep it on? Because shadow build could help you separate your source codes and build binary. Third select you want to build your binary as debug or release version(surrounded by blue bounding box). Usually we cannot mix debug/release libraries together, I will open another topic to discuss the benefits of debug/release, explain what is MT/MD, which one you should choose etc.

6. Click on the run button or Ctrl + R, then you should see the example running on your computer.



Tuesday, 10 October 2017

Deep learning 11-Modern way to estimate homography matrix(by light weight cnn)

  Today I want to introduce a modern way to estimate relative homography between a pair of images. It is a solution introduced by the paper titled Deep Image Homography Estimation.


Q : What is a homography matrix?

A : Homography matrix is a 3x3 transformation matrix that maps the points in one image to the corresponding points in another image.

Q : What are the use of homography matrix?

A : There are many applications depend on the homography matrix, a few of them are image stitching, camera calibration, augmented reality.

Q : How to calculate a homography matrix between two images?

A : Traditional solution are based on two steps, corner estimation and robust homography estimation. In corner detection step, You need at least  4 points correspondences between the two images, usually we would find out these points by matching features like AKAZE, SIFT, SURF.  Generally, the features found by those algorithms are over complete, we would prune out the outliers(ex : by RANSAC) after corner estimation. If you are interesting about the whole process describe by c++, take a look at this project.

Q : Traditional solution require heavy computation, do we have another way to obtain homography matrix between two images?

A : This is the question the paper want to answer, instead of design the features by hand, this paper design an algorithm to learn the homography between two images. The biggest selling point of the paper is they turn the homography estimation problem into a machine learning problem.


  This paper use VGG style CNN to measure the homography matrix between two images, they call it HomographyNet. This model is trained in an end to end fashion, quite simple and neat.

  HomographyNet come with two versions, classification and regression. Regression network produces eight real value numbers and use L2 loss as the final layer. Classification network use softmax as the final layer and quantize every real values into 21bins. First version has better accuracy, while average accuracy of second version is much worse than first version, it can produce confidences. 


4-Point Homography Parameterization

  Instead of using a 3x3 homography matrix as the label(ground truth), this paper use 4-point parameterization as label.

Q : What is 4-point parameterization?

A : 4-point parameterization store the different of 4 corresponding points between two images, Fig03 and Fig04 explain it well.


Q : Why do they use 4-point parameterization but not 3x3 matrix?

A : Because the 3x3 homography is very difficult to train, the problem is the 3x3 matrix mixing 
rotation and translation together, the paper explain why. 

The submatrix [H11, H12; H21, H22] represents the rotational terms in the homography., while the vector [H13, H23] is the translational offset. Balancing the rotational and translational terms as part of an optimization problem is difficult.


  Data Generation

Q : Training deep convolution neural networks from scratch requires a large amount of data, where could we obtain the data?

A : The paper invent a smart solution to generate nearly unlimited number of labeled training examples. Fig05 summarize the whole process



  Results of my implementation is outperform the paper, average loss of mine is 2.58, while the paper is 9.2. Largest loss of my model is 19.53. Performance of my model are better than the paper more than 3.5 times(9.2/2.58 = 3.57).  What makes the performance improve so much?A few of reasons I could think of are

1. I change the network architectures from vgg like to squeezeNet1.1 like.
2. I do not apply any data augmentation, maybe blurring or occlusion cause the model harder to train.
3. The paper use data augmentation to generate 500000 data for training, but I use 500032 images from imagenet as my training set. I guess this potentially increase variety of the data, the end result is network become easier to train and more robust(but they may not work well for blur or occlusion).

  Following are some of the results, the region estimated by the model(red rectangle) is very close to the real regions(blue rectangle).


Final thoughts

  The results looks great, but this paper do not answer two important questions.

1. The paper only test on synthesis images, do they work on real world images?
2. How should I use the trained model to predict a homography matrix?

  I would like to know the answer, if anyone find out, please leave me a message.

Codes and model

  As usual, I place my codes at github, model at mega.

  If you liked this article, please help others find it by clicking the little g+ icon below. Thanks a lot!