Wednesday 27 March 2019

Asynchronous computer vision algorithm

    In last post I introduce how to create an asynchronous class to capture the frame by cv::VideoCapture, today I would show you how to create an asynchronous algorithm which could be called a lot of times without re-spawn any new thread. 

  Main flow of async_to_gray_algo 

    async_to_gray_algo is a small class which convert the image from bgr channels to gray image in another thread.  If you have use any thread pool library before, all of them are using similar logic under the hood, but with more generic, flexible api.

async_to_gray_algo::async_to_gray_algo(cv::Mat &result, std::mutex &result_mutex) :
    auto func = [&]()
        //1. In order to reuse the thread, we need to keep it alive
        //that is why we should put it in an infinite for loop
            unique_lock<mutex> lock(mutex_);
            //2. use condition_variable to replace sleep(x milliseconds) is more efficient
            wait_.wait(lock, [&]() //wait_ will acquire the lock if condition satisfied
                return stop_ || !input_.empty();
            //3. stop the thread in destructor

            //4. convert and write the results into result_
            //we need gmutex to synchronize the result_, else it may incur
            //race condition in the main thread.
                lock_guard<mutex> glock(result_mutex);
                cv::cvtColor(input_, result_, COLOR_BGR2GRAY);
            //5: clear the input_, else the wait_ variable may wake up and continue the task
            //due to spurious wake up
    thread_ = std::thread(func);

    After we initialize the thread, all we need to do is call it by the process api whenever we need to convert image from bgr channels to gray image.

void async_to_gray_algo::process(Mat input)
        lock_guard<mutex> lock(mutex_);
        input_ = input;
    //wait condition will acquire the mutex after it receive notification 

     If we do not need this class anymore, we can and should stop it in the destructor, always followed the rule of RAII when you can is a best practices to keep your codes clean, robust and (much)easier to maintain(let the machine do the jobs of book keeping for humans).

        lock_guard<mutex> lock(mutex_);
        stop_ = true;

What is spurious wake up?

    That means the condition_variable may wake up even no notification(notify_one or notify_all) happened. This is one of the reason why we should not wait without a condition(Another reason lost wake up).

Do we have a better way to reuse the thread?

    Yes, we have. The easiest solution is create a generic thread pool, you can check the codes of a simple thread pool at here. I would show you how to use it in the future.

Better way to pass the variable between different thread?

    As you see, the way I communicate between main thread and the other thread are awkward, it will be a hell to maintain the source codes like that when your program become bigger and bigger. Fortunately, we have better way to pass the variable between different thread with the help of Qt5, by their signal and slot mechanism.Not to mention, Qt5 can help us make the codes much more easy to maintain.


      The source codes of async_opencv_video_capture could find on github.

Saturday 23 March 2019

Asynchronous videoCapture of opencv

    Today I would like to introduce how to create an asynchronous videoCapture by opencv and standard library of c++. Captured video from HD video, especially the HD video from internet could be a time consuming task, it is not a good idea to waste the cpu cycle to wait the frame arrive, in order to speed up our app, or keep the gui alive, we better put the video capture part into another thread.

    With the helps of thread facilities added since c++11, make the videoCapture of opencv support cross platform asynchronous read operation become a simple task, let us have a simple example.

#include <ocv_libs/camera/async_opencv_video_capture.hpp>
#include <opencv2/core.hpp>
#include <opencv2/highgui.hpp>

#include <iostream>
#include <mutex>

int main(int argc, char *argv[])
    if(argc != 2){
        std::cerr<<"must enter url of media\n";
        return -1;

    std::mutex emutex;
    //create the functor to handle the exception when cv::VideoCapture fail
    //to capture the frame and wait 30 msec between each frame
    long long constexpr wait_msec = 30;
    ocv::camera::async_opencv_video_capture<> cl([&](std::exception const &ex)
        //cerr of c++ is not a thread safe class, so we need to lock the mutex
        std::lock_guard<std::mutex> lock(emutex);
        std::cerr<<"camera exception:"<<ex.what()<<std::endl;

        return true;
    }, wait_msec);

    //add listener to process captured frame
    //the listener could process the task in another thread too,
    //to make things easier to explain, I prefer to process it in
    //the same thread of videoCapture
    cv::Mat img;
    cl.add_listener([&](cv::Mat input)
        std::lock_guard<std::mutex> lock(emutex);
        img = input;
    }, &emutex);

    //execute the task(s);

    //We must display the captured image at main thread but not
    //in the listener, because every manipulation related to gui
    //must perform in the main thread(it also called gui thread)
    for(int finished = false; finished != 'q';){
        finished = std::tolower(cv::waitKey(30));
        std::lock_guard<std::mutex> lock(emutex);
            cv::imshow("frame", img);

Important details of async_opencv_video_capture

1. Create an infinite for loop to read the frame in another thread

void run()
        //before we start the thread,
        //we need to stop it
        //call join before task(s)
        //of the thread done

    //create a new thread

    void create_thread()
        thread_ = std::make_unique<std::thread>([this]()
            //read the frames in infinite for loop
            for(cv::Mat frame;;){
                std::lock_guard<Mutex> lock(mutex_);
                if(!stop_ && !listeners_.empty()){
                    }catch(std::exception const &ex){
                        //reopen the camera if exception thrown ,this may happen frequently when you
                        //receive frames from network

                        for(auto &val : listeners_){

    The listeners_ is a vector which stores the std::function<cv::Mat> to be called in the infinite loop if the frame readed by the videoCapture was not empty. The users must handle the exceptions thrown by those functors by themselves else the app will crash.

2. Stop the thread in the destructor

void set_stop(bool val)
    std::lock_guard<Mutex> lock(mutex_);
    stop_ = val;

void stop()

template<typename Mutex>

    We must stop and join the thread in the destructor, else the thread may never end and cause the app freeze.

3. Select mutex type by template

    By default, async_opencv_video_capture use std::mutex, it is more efficient but may cause dead lock if you called the api of async_opencv_video_capture in the listeners. If you want to avoid dead lock this issue, use std::recursive_mutex to replace std::mutex.


      The source codes of async_opencv_video_capture could find on github.

Saturday 9 March 2019

Build mxnet 1.3.1 on windows

    If you were like me, tried to build mxnet 1.3.1 on windows, you may suffer a lot of pains since mxnet do not have decent support on windows, apparently the developers of mxnet do not perform enough tests(maybe none) on windows before they release the stable version. Despite of all of the troubles mxnet brought, it is still a nice tool of deep learning, that is why I am still prefer to work with it.

    I believe one of the best way to make the open source project become better is contribute something back to it, that is why I would like to write down how to build mxnet 1.3.1 on windows step by step.

1. Do not build mxnet on windows with intel mkl

    Do not do this unless you are asking for trouble, please check the details on stackoverflow and issue 14343.

2. Build openBLAS with native msvc ABI

    The openBLAS post at here do not work with vc2015 anymore(if you updated your vc2015), the abi are not compatible with msvc. The easiest solution to solve this issue is build the openBLAS by yourself.The steps are

a. Clone openBLAS of xianyi from github
b. Compile openBLAS as the instruction shown here. Do not install Anaconda and miniconda together, just pick one of them. If you do not know where is your vcvars64.bat on your pc, I suggest you use Everything to find the path.
c. Copy the files(cblas.h, f77blas.h) from the generated  folder into the build folder.

3. Clone mxnet fork by me

git clone --recursive
cd mxnet 
git checkout 1.3.1_win_compile_fix
    This branch fix some type mismatch errors.

4. Comment out codes in

     This file is under the folder "mxnet\src\operator\random", there is a function ShuffleForwardGPU, 
comment out the implementation else there will have a lot of compile times errors(no suitable 
user-defined conversion from "mshadow::Tensor<mxnet::gpu, 1, mxnet::index_t>" to 
"const mshadow::Tensor<mshadow::gpu, 1, unsigned int>" exists).
     I guess this function would not be called when doing inference task, after all who would like to 
make their inference results become unpredictable? If you were like me, only want to use 
cpp_package to do the inference task, you should be safe to comment out the codes. 

5. Open cmake

  Open your cmake and select msvc with 64bits.

6. Configuration

   The most important note are

1. Do not use anything related to intel MKL.

2. Do not build cpp_package at the first time

    Without mkl mxnet cannot exploit full power of the cpu, but with it your app cannot run at all, 
depending on how you build it, your app may throw the error 
"Intel MKL FATAL ERROR: Cannot load mkl_intel_thread.dll." or 
"Check failed: MXNDArrayWaitToRead(blob_ptr_->handle_) == 0 (-1 vs. 0)"

    If you do not need cuda, uncheck following options, USE_CUDA and USE_CUDNN. After that, click Configure->uncheck BUILD_TESTING->click Configure->click generate.

7. Build mxnet without cpp_package

    a. Open your ALL_BUILD.vcxproj
    b. Navigate to the project "mxnet"
    c. Right click your mouse, select "Properties"
    d. Select Linker->Input
    e. Link to flangmain.lib, flangrti.lib, flang.lib, ompstub.lib. For example, my paths are


    If you do not know where are they, use Everything to find the path.

    f. Navigate to the project "ALL_Build"
    g.Right click your mouse, click build.

8. Configure cmake to build with cpp_package

    Now we can build mxnet with cpp_package, let us go to cmake again and change some settings.

a. (Optional)Change your install path, else you may not be able to install(ex : change to C:/Users/yyyy/programs/Qt/3rdLibs/mxnet/build_gpu_1_3_1_temp/install).

b. Make sure you have set the PATH of python, if you are building 32/64bits version of mxnet, you need python of 32/64bits, else you wouldn't be able to generate op.h. I suggest you use Rapid environment to manage your path on windows.

If your vc complain it cannot find the python exe, reopen your vc.

c. Check USE_CPP_PACKAGE->uncheck BUILD_TESTING->configure->generate

d. Remove the example projects since they will hinder the build process, those projects are alexnet, charRNN, googleNet, inception_bn, lenet, lenet_with_mxdataiter, mlp, mlp_cpu, mlp_gpu, resnset.

e. Go to your build/Release folder, copy the libmxnet.dll into any folder which could be found by
the windows(the path in the Path), let us assume that path call global_path.

f. Open your Developer command prompt(mine is developer command prompt for vs2015), let us call it DCP

g. Navigate your DCP to global_path

h. Enter "dumpbin /dependents libmxnet.dll", this command will show you the dependencies of this
dll. In my case, it show


We only need to copy flangrti.dll, flang.dll, ompstub.dll into global_dll in order to generate op.h, because another dll already exist in the PATH. Again, please use Everything to find the path.

i. Your mxnet need to link to Link to flangmain.lib, flangrti.lib, flang.lib, ompstub.lib again since generate clear them.

j. Navigate to the project "ALL_Build"   

k.  Right click your mouse, click build.

l. Navigate to the project "INSTALL"

m.  Right click your mouse, click build.

n. Copy cpp-package\include\mxnet-cpp into build/install/include

o. Copy mxnet\3rdparty\tvm\nnvm\include\nnvm into build/install/include

9. Add mx_float for scale, int for num_filter

    The op.h generated by this solution, there are two parameters lack type declaration, you need to add them by yourself.


    Congratulation, now you have build the mxnet successfully, to tell you the truth, this is not a pleasant journey, there are too many bugs/issues when I try to build mxnet1.3.1on windows(1.4.0 got more bugs on windows when you try to build it) , there are many bugs should be found before they release the major version if them have tried to build mxnet on windows. I believe windows and cpp_package are not their main concern yet, let us hope that 

a. Someday they can put more love into windows and cpp_package. Windows still dominate market of desktop/laptop and cpp_package is a much better choice than python if you want to do edge deployment.

b. Adopt a commit system like opencv(whenever you commit your codes, opencv build it on every single platforms they support), this could prevent a log of bugs, the later you adopt, the more cost you need to pay for cross-platform.

c. Let us cross our finger, hope them can fix all of these bugs before next version release

Sunday 24 February 2019

Age and gender classification by opencv, dlib and mxnet

    In this post, I will show you how to build an age gender classification application with the infrastructures I created in the last post. Almost everything are same as before, except the part of parsing the NDArray in the forward function.

    Before we dive into the source codes, let us have some examples. The images are predicted by two networks and concatenate, left side predicted by light model, right side predicted by heavy model based on resnet50.


    The results do not looks bad for both of the models if we don't know their ages :), let us use the model to predict the ages of famous person, like trumps(with resnet50).


    Unfortunately, results of age classification are not that good under different angles and expressions, this is because age classification from an image is very difficult, even human cannot accurately predict ages of persons by looking at a single image.    

1. Difference of face recognition and age gender classification

    The codes of parsing face recognition features are

std::vector<insight_face_key> result;
size_t constexpr feature_size = 512;
Shape const shape(1, feature_size);
for(size_t i = 0; i != batch_size; ++i){
    NDArray feature(features.GetData() + i * feature_size, shape, Context(kCPU, 0));

    The codes of parsing age and gender classification are

std::vector<insight_age_gender_info> result;
int constexpr features_size = 202;
for(size_t i = 0; i != batch_size; ++i){
    auto const *ptr = features.GetData() + i * features_size;
    insight_age_gender_info info;
    info.gender_ = ptr[0] > ptr[1] ? gender_info::female_ : gender_info::male_;
    for(int i = 2; i < features_size; i += 2){
        if(ptr[i + 1] > ptr[i]){
                info.age_ += 1;

    Except of these part, everything are the same as before.

2. Make codes easier to reuse

    It is a pain to maintain similar codes with minor difference, in order to alleviate the prices of maintenance, I create a generic predictor as a template class with three policies, implement the face recognition and age/gender classification with this generic predictor.

template<typename Return, typename ProcessFeature, typename ImageConvert = dlib_mat_to_separate_rgb>
class generic_predictor
/*please check the details on github*/

    We could use it like to create age gender predictor as following

struct predict_age_gender_functor
    operator()(const mxnet::cpp::NDArray &features, size_t batch_size) const
        std::vector<insight_age_gender_info> result;
        int constexpr features_size = 202;
        for(size_t i = 0; i != batch_size; ++i){
            auto const *ptr = features.GetData() + i * features_size;
            insight_age_gender_info info;
            info.gender_ = ptr[0] > ptr[1] ? gender_info::female_ : gender_info::male_;
            for(int i = 2; i < features_size; i += 2){
                if(ptr[i + 1] > ptr[i]){
                    info.age_ += 1;
        return result;
using insight_age_gender_predict = mxnet_aux::generic_predictor<insight_age_gender_info, predict_age_gender_functor>;

    Please check github if you want to know the implementation details.

3. Summary

    Gender prediction works very well, unfortunately age predictions is far from ideal. If we could obtain huge data set, which contain the face of the same person with different range of ages, expression, angles and the number of races are not super imbalance, accuracy of age accuracy may improve very much, but a huge data set like this is very hard to collect. 

    The source codes could find on github.

Monday 18 February 2019

Face recognition with mxnet, dlib and opencv

   In this post I will show you how to implement an industrial level, portable face recognition application with a small, reuseable example, without relying on any commercial library(except of Qt5, unless the module I use in this example support LGPL license).

    Before deep learning become main stream technology in computer vision fields, 2D face recognition only works well under strict environments, this make it an impractical technology.

    Thanks to the contributions of open source communities like dlib, opencv and mxnet, today, high accuracy 2D face recognition is not a difficult problem anymore.

    Before we start, let us see an interesting example(video_00).


     Although different angles and expressions affect the confidence value a lot, but in most of the time the algorithm still able to find out the most similar faces from 25 faces.

     The flow of face recognition on github are composed by 4 critical steps.


Detect face by dlib   

std::vector<mmod_rect> face_detector::forward_lazy(const cv::Mat &input)
    //make sure input image got 3 channels
    CV_Assert(input.channels() == 3);

    //Resize the input image to certain width, 
    //The bigger the face_detect_width_, more 
    //faces could be detected, but will consume
    //more memory, and slower
    if(input.cols != face_detect_width_){
        //resize_cache_ is a simple trick to reduce the
        //number of memory allocation
        double const ratio = face_detect_width_ / 
        cv::resize(input, resize_cache_, {}, ratio, ratio);
        resize_cache_ = input;

    //1. convert cv::Mat to dlib::matrix
    //2. Swap bgr channel to rgb
    img_.set_size(resize_cache_.rows, resize_cache_.cols);
    dlib::assign_image(img_, dlib::cv_image<bgr_pixel>(resize_cache_));

    return net_(img_);
    Face detector of dlib perform very well, you can check the results on their post.

    If you want to know the details, please study the example provided by dlib, if you want to know more options, please study the excellent post of Learn Opencv.

Perform face alignment by dlib

    We can treat face alignment as a data normalization skills develop for face recognition, usually you would align the faces before training your model, and align the faces when predict, this could help you obtain higher accuracy.

    With dlib, face alignment become very simple. Just a few lines of codes.

//rect contain the roi of the face
dlib::matrix<rgb_pixel> face_detector::
get_aligned_face(const mmod_rect &rect)
    //Type of pose_model_ is dlib::shape_predictor
    //It return the landmarks of the face
    auto shape = pose_model_(img_, rect);
    matrix<rgb_pixel> face_chip;
    auto const details = 
          get_face_chip_details(shape, face_aligned_size_, 0.25);
    //extract face after aligned from the image
    extract_image_chip(img_, details, face_chip);
    return face_chip;

Extract features of face by mxnet

    This section will need to load the model from mxnet, unlike dlib or opencv, the c++ api of mxnet is more complicated, if you do not know how to load the model of mxnet yet, I recommend you study this post.

    This section is the most complicated part, because it contains three main points

1.  Extract the features of faces.
2.  Perform batch processing.
3.  Convert aligned face of dlib(store as matrix<rgb_pixel>) to a memory continuous float array with
the format expected by the mxnet model.

A.Load the model with variable batch size

    In order to load the model which support variable batch size, all we need to do is add one more argument to the argument list.

std::unique_ptr<Executor> create_executor(const std::string &model_params,
                                          const std::string &model_symbols,
                                          const Context &context,
                                          const Shape &input_shape)
    Symbol net;
    std::map<std::string, NDArray> args, auxs;
    load_check_point(model_params, model_symbols, &net, 
                     &args, &auxs, context);

    //if "data" throw exception, try another key, like "data0"
    args["data"] = NDArray(input_shape, context, false);
    //we only need to add the new key if batch size larger than 1
    if(input_shape[0] > 1){
        //all we need is the new key "data1"
        args["data1"] = NDArray(Shape(1), context, false);

    std::unique_ptr<Executor> executor;
                                  std::map<std::string, NDArray>(),
                                  std::map<std::string, OpReqType>(), 

    return executor;

B.Convert aligned face to array

    Unlike the example of yolo v3, the input data of deepsight need more preprocess steps before you can feed the aligned face into the model. Instead of arranged the pixels as rgb order, you need to split each channels of the face into separate "page". Simply put, instead of arrange the pixels as


We should arrange the pixels as


//using dlib_const_images_ptr = std::vector<matrix<rgb_pixel> const*>;
void face_key_extractor::
dlib_matrix_to_float_array(dlib_const_images_ptr const &rgb_image)
    size_t index = 0;
    for(size_t i = 0; i != rgb_image.size(); ++i){
        for(size_t ch = 0; ch != 3; ++ch){
            for(long row = 0; row != rgb_image[i]->nr(); ++row){
                for(long col = 0; col != rgb_image[i]->nc(); ++col){
                    auto const &pix = (*rgb_image[i])(row, col);
                    case 0:
                        //image_vector_ is a std::vector<float>, resized in 

                        //params_->shape_.Size() return total number 
                        //of elements in the tenso
                        image_vector_[index++] =;
                    case 1:
                        image_vector_[index++] =;
                    case 2:
                        image_vector_[index++] =;

C.Forward aligned faces with variable batch size

    There are two things you must know before we dive into the source codes.

1. To avoid memory reallocation, we must allocate memory for the largest possible batch size and reuse that same memory when batch size is smaller.
2.  The batch size of the float array input to the model must be the same as the largest possible batch size

//input contains all of the aligned faces detected from the image
std::vector<face_key> face_key_extractor::
forward(const std::vector<dlib::matrix<dlib::rgb_pixel> > &input)
        return {};

    //Size of the input may not divisible by batch size
    //That is why we need some preprocess job to make sure
    //features of every faces are extracted
    auto const forward_count = static_cast<size_t>(
         std::ceil(input.size() / static_cast<float>(params_->shape_[0])));
    std::vector<face_key> result;
    for(size_t i = 0, index = 0; i != forward_count; ++i){
        dlib_const_images_ptr faces;
        for(size_t j = 0; 
            j != params_->shape_[0] && index < input.size(); ++j){
        auto features = 
             forward(image_vector_, static_cast<size_t>(faces.size()));
        std::move(std::begin(features), std::end(features), 

    return result;

D.Extract features of faces

std::vector<face_key> face_key_extractor::
forward(const std::vector<float> &input, size_t batch_size)
    //data1 tell the executor, how many face(s) need to process
    executor_->arg_dict()["data1"] = batch_size;
    std::vector<face_key> result;
        //shape of features is [batch_size, 512]
        auto features = executor_->outputs[0].Copy(Context(kCPU, 0));
        Shape const shape(1, step_per_feature);
        //split features into and array
        for(size_t i = 0; i != batch_size; ++i){
            //step_per_feature is 512, memory 
            //of NDArray is continuous make things easier
            NDArray feature(features.GetData() + i * step_per_feature, 
                            shape, Context(kCPU, 0));
        return result;

    return result;

Find most similar faces from database

    I use cosine similarity to compare similarity in this small example, it is quite easy with the help of 

A.Similarity compare

double face_key::similarity(const face_key &input) const
    CV_Assert(key_.GetData() != nullptr && 
              input.key_.GetData() != nullptr);

    cv::Mat_<float> const key1(1, 512, 
                               const_cast<float*>(input.key_.GetData()), 0);
    cv::Mat_<float> const key2(1, 512, 
                               const_cast<float*>(key_.GetData()), 0);
    auto const denominator = std::sqrt( *;
    if(denominator != 0.0){
        return / denominator;

    return 0;

B.Find most similar face

    Find the most similar face is really easy, all we need to do is compare the features stored in the array one by one and return the one with the highest confidence.

//for simplicity, I put struct at here in this blog
struct id_info
   double confident_ = -1.0;
   std::string id_;

struct face_info
   face_key key_;
   std::string id_;

face_reg_db::id_info face_reg_db::
find_most_similar_face(const face_key &input) const
    id_info result;
    //type of face_keys_ is std::vector<face_info>
    for(size_t i = 0; i != face_keys_.size(); ++i){
        auto const confident = 
        if(confident > result.confident_){
            result.confident_ = confident;
            result.id_ = face_keys_[i].id_;

    return result;


    In today's post, I show you the most critical parts of face recognize with opencv, dlib and mxnet. I believe this is a great starting point if you want to build a high quality face recognition app by c++.

    Real world applications are much more complicated than this small example since they always need to support more features and required to be efficient, but no matter how complex they are, the main flow of the 2D face recognition are almost the same as this post show you.