Pages

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Sometime back, I had a look at "how the  global context is implicitly captured in recent holistic scene understanding methods?". Here's a small write-up about few things I observed.

image credits: https://quotefancy.com
CNNs are the most frequently/ recently used technique for  scene understanding.  The success of CNNs (applied for image classification) can be mostly attributed to the concepts of receptive field and locality.  CNN based models are inherently translation invariant due to locality. Both convolution and pooling operations are capable of progressively increasing the receptive field. This helps to derive abstract concepts from low level features for better generalization. However, still, arguments are going on about the practical effective receptive field being much smaller than the theoretical receptive field. Further, only looking at small region of input image result in the loss of global spatial contextual information. Yet, global context is quite useful for holistic scene understanding, so it is important to have a wider/ broad look at the input image.

How to increase the receptive field in CNNs

  • Adding multiple deeper layers 
  • Incorporate multi-scale scene context 
    • Inflating the size of the filter (a.k.a. dilated convolution)
    • Pooling at different scales (a.k.a. spatial pyramid pooling)

Semantic segmentation is the most widely used technique for holistic scene understanding. Fully Convolutional Networks (FCN) have adopted CNNs (that was initially used for image classification) for the task of semantic segmentation. Given an input image, semantic segmentation outputs a semantic segmentation mask that assigns a pre-determined semantic category to each pixel in the image. FCN achieves this by downsampling of image features followed by a rapid upsampling procedure to reconstruct the segmentation mask. However, this rapid upsampling procedure has lead to loss of contextual information. Recovering pixel level fine details from too coarse features (given as input for upsampling layer) is difficult.

How to upsample without loosing contextual information? 

  • Learn to upsample while remembering the lost context (Deconvolution operation with unpooling) (Pooling is [originally] used to filter out noisy activations)
  • Using the inherent/ built in semantic information flow at different scales in CNNs (Feature Pyramids)
I will mention the related papers soon. :)

No comments:

Post a Comment