用于RGB-D室内语义分割的具有门控融合的局部敏感反卷积网络
problem: indoor semantic segmentation using RGB-D data
motivation: there is still room for improvements in two aspects:
method to adress problems above: Locality-sensitive DeconvNet; gated fusion layer
kinect-- capture high-quality synchronized visual (RGB data) and geometrical (depth data)
DeconvNet-- learn to upsample the low-resolution label map of FCN into full resolution with more details
上图(a)(b)是使用的two-stream DeconvNet followed by score fusion with equal-weight sum like FCN model[19] 体现的两个有待改进那的两个方面的例子
This paper aims to augment DeconvNet for indoor semantic segmentation with RGB-D data
Refine Boundaries for Semantic Segmentation
designing particular deep learning models for dense prediction
add one data driven pooling layer on top of DeconvNet to smooth the predictions in every superpixel[12]
Combine RGB and Depth Data for Semantic Segmentation
three levels of fusion: early middle late
FCN is to learn robust feature representation for each pixel by aggregating multi-scale contextual cues.
LS-DeconvNet is used to restore high-resolution and precise scene details based on the coarse FCN map
a gated fusion layer is introduced to fuse the RGB and depth cues effectively for accurate scene semantic segmentation
concatenate the prediction maps of RGB and depth to learn a weighted gate array
Locality-Sensitive Unpooling 局部敏感去池化
conventional unpooling 最大池化的逆过程,unpooling is helpful to reconstruct detailed
object boundaries, its capability can be limited a lot due to the excessive dependence on the input responding map with large context.
affinity matrix的来源是 RGB-D pixels
discontinuous boundary responses , to make up the missing detais
Locality-Sensitive Average Pooling 局部敏感平均池化
传统平均池化有缺点 to blur object boundaries and result in imprecise semantic segmentation map.
根据affinity matrix 只有相似的像素才会计入平均池化操作
can achieve consistent and robust feature representation for the consecutive object structures.
3 layers concatenation layer/ convolution layer/ sigmoid layer
method[10] extract low-level RGB-D features(gradients over visual and geometrical cues) for each pixel, employ gPb-ucm[1] to generate over-segments. These over-segments can be used to calculate A by verifying that pairwise pixels belong to the same over-segment (similarity is 1) or not (similarity is 0). Note that we will scale A to match the resolution of the corresponding feature maps.
datasets: 2 benchmark RGB-D dataset SUN RGB-D dataset [25] and the popular NYU-Depth v2 dataset
Metrics:pixel accuracy, mean accuracy, mean IOU and frequency weighted IOU
removing or replacing each component independently or both together for semantic segmentation on the NYU-Depth v2 dataset
We owe the improvement to the accurate recognition of some hard objects in the scene by gated fusion, such as box on the sofa and chair in the weak lights.
1) the localitysensitive deconvolution networks, which are designed for simultaneously upsamping the coarse fully convolutional maps and refining object boundaries; 2) gated fusion, which can adapt to the varying contributions of RGB and depth for better fusion of the two modalities for object recognition.
原文:https://www.cnblogs.com/Jerry-Dong/p/9726447.html