Deep learning, in recent years this technique take over many difficult tasks of computer vision, semantic segmentation is one of them. The first segmentation net I implement is LinkNet, it is a fast and accurate segmentation network.
Introduction
Q :
What is LinkNet?
A : LinkNet is a convolution neural network designed for semantic segmentation. This network is 10 times faster than
SegNet and more accurate.
Q :
What is semantic segmentation? Any difference with segmentation?
A : Of course they are difference. Segmentation partition image into several "similar" parts, but you
do not know what are those parts presents. On the other hand, semantic segmentation partition the image into different
pre-determined labels. Those labels are present as color as the end results. For example, checkout the following images(from
camvid).
Q :
Semantic segmentation sounds like object detection, are they the same thing?
A : No, they are not, although you may achieve the same goal by both of them.
From the aspect of tech, they use different approach. From the view of end results, semantic segmentation tell you what are those pixels are, but they
do not tell you how many instance in your images, object detection show you how many instance in your images by minimal bounding box, but it
do not give you
Network architectures
LinkNet paper describe their network architecture with excellent graphs and simple descriptions, following are the figures copy shameless from the paper.
LinkNet adopt encoder-decoder architecture, according to the paper, LinkNet performance or come from adding the output of encoder to the decoder, this help the decoder easier to recover the information. If you want to know the details, please study section 3 of the paper, it is nice writing, very easy to understand.
Q :
The paper is easy to read, but they do not explain what is full convolution, could you tell me what that means?
A :
Full convolution indicates that the neural network is composed of convolution layers and activation only, without any full connection or pooling layers.
Q : How do they perform down-sampling without pooling layers?
A : Make the stride of convolution as 2 x 2 and do zero padding, if you cannot figure it out why this work, I suggest you create an excel file, write down some data and do some experiment.
Q :
Which optimizer work best?
A
: According to the paper,
rmsprop is the winner, my experiments told me
the same thing too, in case you are interesting, below are the graph of
training loss. From left to right is rmsprop, adam, sgd. Hyper
parameters are
Initial learning rate : adam and rmsprop are 5e-4, sgd is 1e-3
Augmentation : random crop(480,320) and horizontal flip
Normalize : subtract mean(based on imagenet mean value) and divided by 255
Batch size : 16
Epoch : 800
Training examples : 368
The
results of adam and rmsprop are very close. Loss of sgd steadily
decrease, but it converge very slow even with higher learning rate,
maybe higher learning rate would work better for SGD.
Data pre-processing
Almost every computer vision task need you to pre-process your data, segmentation is not an exception, following are my steps.
1 : Convert the color do not exist in the category into void(0, 0, 0)
2 : Convert the color into integer
3 : Zero mean(mean value come from imagenet)
Experiment on camvid
Enough of Q&A, let us have some benchmark and pictures😊.
 |
Performance |
Model 1,2,3 all train with same parameters, pre-processing but with different input size when training, they are (128,128), (256,256), (512, 512). When testing, the size of the images are (960,720).
Following are some examples, from left to right is original image, ground truth and predicted image.
Results looks quite good and
IoU is much better than the paper, possible reasons are
1 : I augment the data by random crop and horizontal flip, the paper may use another methods or do not perform augmentation at all(?).
2 : My pre-processing are different with the paper
3 : I did not omit void when training
4 : My measurement on IoU is wrong
5 : My model is more complicated than the paper(wrong implementation)
6 : It is overfit
7 : Random shuffle training and testing data create data leakage because many images of camvid
are very similar to each other
Trained models and codes
1 : As usual, located at
github.
2 :
Model trained with 368 images, 12 labels(include void), random crop (128x128),800 epoch
3 :
Model trained with 368 images, 12 labels(include void), random crop (480x320),800 epoch
4 :
Model trained with 368 images, 12 labels(include void), random crop (512x512),800 epoch
Miscellaneous
Q :
Is it possible to create portable model by PyTorch?
A : It is possible, but not easy. you could check out
ONNX and caffe2 if you want to try it. Someone manage to convert pytorch model to caffe model and loaded by opencv dnn. Right now opencv dnn do not support
PyTorch but
PyTorch. Thanks god opencv dnn can import model trained by
torch at ease(right now opencv dnn
do not support nngraph).
Q :
What are IoU and iIoU in the paper refer to?
A : This
page give good definition, although I still can't figure out how to calculate iIoU.
If you liked this article, please help others find it by clicking the little g+ icon below. Thanks a lot!