Dev 007: Context learning

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Sometime back, I had a look at "how the global context is implicitly captured in recent holistic scene understanding methods?". Here's a small write-up about few things I observed.

image credits: https://quotefancy.com

CNNs are the most frequently/ recently used technique for scene understanding. The success of CNNs (applied for image classification) can be mostly attributed to the concepts of receptive field and locality. CNN based models are inherently translation invariant due to locality. Both convolution and pooling operations are capable of progressively increasing the receptive field. This helps to derive abstract concepts from low level features for better generalization. However, still, arguments are going on about the practical effective receptive field being much smaller than the theoretical receptive field. Further, only looking at small region of input image result in the loss of global spatial contextual information. Yet, global context is quite useful for holistic scene understanding, so it is important to have a wider/ broad look at the input image.

How to increase the receptive field in CNNs?

Adding multiple deeper layers
Incorporate multi-scale scene context

Inflating the size of the filter (a.k.a. dilated convolution)
Pooling at different scales (a.k.a. spatial pyramid pooling)

Semantic segmentation is the most widely used technique for holistic scene understanding. Fully Convolutional Networks (FCN) have adopted CNNs (that was initially used for image classification) for the task of semantic segmentation. Given an input image, semantic segmentation outputs a semantic segmentation mask that assigns a pre-determined semantic category to each pixel in the image. FCN achieves this by downsampling of image features followed by a rapid upsampling procedure to reconstruct the segmentation mask. However, this rapid upsampling procedure has lead to loss of contextual information. Recovering pixel level fine details from too coarse features (given as input for upsampling layer) is difficult.

How to upsample without loosing contextual information?

Learn to upsample while remembering the lost context (Deconvolution operation with unpooling) (Pooling is [originally] used to filter out noisy activations)
Using the inherent/ built in semantic information flow at different scales in CNNs (Feature Pyramids)

I will mention the related papers soon. :)

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

My Daily Dose of Creativity (DDoC): Day 03 (No, not giving up yet :P)

Why?
Most of the computer vision algorithms focus about what's seen in image. e.g., Is there a curb ramp in this image? If so, where it is? We need to infer what is not there in the scene too. e.g., where could be a curb ramp in this image? Can there be a curb ramp in this image?

Where could be a curb ramp in the second image (green)?

What?
Learn contextual features to predict the possibility of having an object in the scene, even if the object cannot be seen clearly

How?

Training: Learn contextual features (e.g., the surrounding characteristics) for a given object category (e.g., curb ramps) using a binary classifier (to predict whether there can be an object in the given image or not). Positive examples are taken by masking out the object bounding box in images. Negative examples are taken by masking out a similar, corresponding area as positive examples using random image crops (to prevent network from learning the masking dimension). If there are any other objects around positive example, they are also being considered as context and hence multiple objects in one image are not considered. This constitutes the base of the training algorithm. Then, in order to mitigate the artifact issues due to the said classifier, another classifier is used without bounding box masks. The idea is to just ignore the object and let the network to implicitly learn the context. During training, both classification loss and distance loss (difference between the first explicit context classifier and second implicit context classifier) are taken in to consideration.
Inference: First the object bounding boxes are detected. Pixels inside the object bounding boxes are marked as 0 and pixels outside bounding boxes are marked as 1. Then, a heat map that represents the context is generated where pixels with high probabilities are marked as 1 and low probabilities are marked as 0. After that, pixel-wise AND operation is performed between the afore-mentioned two representations.

More information can be found in their paper.

image source: http://openaccess.thecvf.com/content_cvpr_2017/papers/Sun_Seeing_What_Is_CVPR_2017_paper.pdf

Pages

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing