Monday, 28 August 2017

Deep learning 10-Let us create a semantic segmentation model(LinkNet) by PyTorch

  Deep learning, in recent years this technique take over many difficult tasks of computer vision, semantic segmentation is one of them. The first segmentation net I implement is LinkNet, it is a fast and accurate segmentation network. 

Introduction


Q : What is LinkNet?

A :  LinkNet is a convolution neural network designed for semantic segmentation. This network is 10 times faster than SegNet and more accurate.

Q : What is semantic segmentation? Any difference with segmentation?

A :  Of course they are difference. Segmentation partition image into several "similar" parts, but you do not know what are those parts presents. On the other hand, semantic segmentation partition the image into different pre-determined labels. Those labels are present as color as the end results. For example, checkout the following images(from camvid).



Q : Semantic segmentation sounds like object detection, are they the same thing?

A : No, they are not, although you may achieve the same goal by both of them.
From the aspect of tech, they use different approach. From the view of end results, semantic segmentation tell you what are those pixels are, but they do not tell you how many instance in your images, object detection show you how many instance in your images by minimal bounding box, but it do not give you delienation of objects. For example, checkout below images(from yolo).




Network architectures


  LinkNet paper describe their network architecture with excellent graphs and simple descriptions, following are the figures copy shameless from the paper.





  LinkNet adopt encoder-decoder architecture, according to the paper, LinkNet performance or come from adding the output of encoder to the decoder, this help the decoder easier to recover the information. If you want to know the details, please study section 3 of the paper, it is nice writing, very easy to understand.

Q : The paper is easy to read, but they do not explain what is full convolution, could you tell me what that means?

A :  Full convolution indicates that the neural network is composed of convolution layers and activation only, without any full connection or pooling layers. 

Q : How do they perform down-sampling without pooling layers?

A : Make the stride of convolution as 2 x 2 and do zero padding, if you cannot figure it out why this work, I suggest you create an excel file, write down some data and do some experiment.

Q : Which optimizer work best?

A : According to the paper, rmsprop is the winner, my experiments told me the same thing too, in case you are interesting, below are the graph of training loss. From left to right is rmsprop, adam, sgd. Hyper parameters are

Initial learning rate : adam and rmsprop are 5e-4, sgd is 1e-3
Augmentation : random crop(480,320) and horizontal flip
Normalize : subtract mean(based on imagenet mean value) and divided by 255
Batch size : 16
Epoch : 800
Training examples : 368



  The results of adam and rmsprop are very close. Loss of sgd steadily decrease, but it converge very slow even with higher learning rate, maybe higher learning rate would work better for SGD.

Data pre-processing


  Almost every computer vision task need you to pre-process your data, segmentation is not an exception, following are my steps.

1 : Convert the color do not exist in the category into void(0, 0, 0)
2 : Convert the color into integer
3 : Zero mean(mean value come from imagenet)

Experiment on camvid


  Enough of Q&A, let us have some benchmark and pictures😊.

Performance
  Model 1,2,3 all train with same parameters, pre-processing but with different input size when training, they are (128,128), (256,256), (512, 512). When testing, the size of the images are (960,720).

  Following are some examples, from left to right is original image, ground truth and predicted image.













  Results looks quite good and  IoU is much better than the paper, possible reasons are

1 : I augment the data by random crop and horizontal flip, the paper may use another methods or do not perform augmentation at all(?).

2 : My pre-processing are different with the paper

3 : I did not omit void when training

4 : My measurement on IoU is wrong

5 : My model is more complicated than the paper(wrong implementation)

6 : It is overfit

7 : Random shuffle training and testing data create data leakage because many images of camvid
are very similar to each other


Trained models and codes


1 : As usual, located at github.
2 :  Model trained with 368 images, 12 labels(include void), random crop (128x128),800 epoch
3 :  Model trained with 368 images, 12 labels(include void), random crop (480x320),800 epoch
4 :  Model trained with 368 images, 12 labels(include void), random crop (512x512),800 epoch

Miscellaneous


Q : Is it possible to create portable model by PyTorch?

A : It is possible, but not easy. you could check out ONNX and caffe2 if you want to try it. Someone manage to convert pytorch model to caffe model and loaded by opencv dnn. Right now opencv dnn do not support PyTorch but PyTorch. Thanks god opencv dnn can import model trained by torch  at ease(right now opencv dnn do not support nngraph).

Q : What are IoU and iIoU in the paper refer to?

A : This page give good definition, although I still can't figure out how to calculate iIoU.


  If you liked this article, please help others find it by clicking the little g+ icon below. Thanks a lot!

Monday, 7 August 2017

Deep learning 09-Performance of perceptual losses for super resolution

    Have you ever scratch your head when upscaling low resolution images? I do, because we all know the quality of the images after upscaling degrade. Thanks to the rise of machine learning in recent years, we are able to upscale single image with better results compare with traditional solutions(ex : bilinear, bicubic. You do not need to know what they are except they are apply widely in many products), we call this technique super resolution.

    This sound great, but how could we do it?I did not know it either until I study the tutorials of part2 of the marvelous Practical Deep learning for Coders, this course is fantastic to get your feet wet on deep learning.

    I will try my best to explain everything with minimal prerequisite knowledge on machine learning and computer vision, however, some knowledge of convolution neural network(cnn) is needed. The course of part1 is excellent if you want to learn cnn in depth. If you are in a hurry, pyimagesearch and medium has a short tutorial about cnn.

What is super resolution and how does it work


Q : What is super resolution

A :  Super resolution is a class of technique to enhance the resolution of images or videos.

Q : There are many softwares could help us upscale images, why do we need super resolution?

A : Traditional solutions of upscaling image apply interpolation algorithm on one image only(ex: bilinear or bicubic). In the contrast, super resolution exploit info from another source, either from contiguous frames, from the model trained by machine learning or different scale from one image.

Q : How does super resolution work

A : Super resolution I want to introduce today is based on Perceptual losses for Real-Time style Transfer and Super-Resolution.(please consult wiki if you want to study another type of super resolution).  The most interesting part of this solution is it treat super resolution as an image transformation problem(it is a process where an input image is transformed into an output image). This mean we may use the same technique to solve colorization, denoising, depth estimation, semantic segmentation and another tasks(It is not a problem if you do not know what they are).

Q : How do we transformed low resolution image to high resolution image?

A : A picture worth a thousand words.



    This network is composed by two components, image transformation network and a loss network. Image transformation network transform low resolution image into high resolution image, while loss network measuring the difference between predicted high resolution image and the true high resolution image

Q : What is the loss network anyway?Why do we use it to measure the loss?

A : Loss network is an image classification network train on imagenet (ex : vgg16, resnet, densenet). We use it to measure the loss because we want our network to better measure perceptual and semantic difference between images. The paper call the loss measure by this loss network perceptual loss.

Q : What makes the loss network able to generate better loss?

A : The loss network can generate better loss because the convolutional neural network trained for image classification have already learned to encode the perceptual and semantic information we want.

Q : The color of the image is different after upscale, how could I fixed it?

A : You could apply histogram matching as the paper mentioned, this should be able to deal with most of the cases.

Q : Any draw back of this algorithm?

A : Of course, nothing is perfect.

1 : Not all of the image work, they may look very ugly after upscale.
2 : The image maybe ice cream to your eyes, but it is not reconstructing the photo exactly but create details based on its training from example images.It is impossible to reconstruct the image with perfect results, because we have no way to retrieve the information did not exist from the beginning.
3 : Color of part of the images change after upscale, even histogram matching cannot fix it.

Q : What is histogram matching?

A : It is a way to make the color distribution of image A looks like image B.

Experiment

    All of the experiments use same network architecture and train on 80000 images from imagenet, 2 epoch. From left to right are original image, image upscale 4x by bicubic, image upscale by super resolution by 4x.











    The results are not perfect, but this is not the end, super resolution is a hot research topic, every paper is a stepping stone for next algorithm, we will see more and more better, advance techniques pop out in the future.

Sharing trained model and codes

1 : Notebook to transform the imagenet data to training data
2 : Notebook to train and use the super resolution model
3 : Network model with transformation network and loss network, trained on 80000 images

    If you liked this article, please help others find it by clicking the little g+ icon below. Thanks a lot!