Pages

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Sometime back, I had a look at "how the  global context is implicitly captured in recent holistic scene understanding methods?". Here's a small write-up about few things I observed.

image credits: https://quotefancy.com
CNNs are the most frequently/ recently used technique for  scene understanding.  The success of CNNs (applied for image classification) can be mostly attributed to the concepts of receptive field and locality.  CNN based models are inherently translation invariant due to locality. Both convolution and pooling operations are capable of progressively increasing the receptive field. This helps to derive abstract concepts from low level features for better generalization. However, still, arguments are going on about the practical effective receptive field being much smaller than the theoretical receptive field. Further, only looking at small region of input image result in the loss of global spatial contextual information. Yet, global context is quite useful for holistic scene understanding, so it is important to have a wider/ broad look at the input image.

How to increase the receptive field in CNNs

  • Adding multiple deeper layers 
  • Incorporate multi-scale scene context 
    • Inflating the size of the filter (a.k.a. dilated convolution)
    • Pooling at different scales (a.k.a. spatial pyramid pooling)

Semantic segmentation is the most widely used technique for holistic scene understanding. Fully Convolutional Networks (FCN) have adopted CNNs (that was initially used for image classification) for the task of semantic segmentation. Given an input image, semantic segmentation outputs a semantic segmentation mask that assigns a pre-determined semantic category to each pixel in the image. FCN achieves this by downsampling of image features followed by a rapid upsampling procedure to reconstruct the segmentation mask. However, this rapid upsampling procedure has lead to loss of contextual information. Recovering pixel level fine details from too coarse features (given as input for upsampling layer) is difficult.

How to upsample without loosing contextual information? 

  • Learn to upsample while remembering the lost context (Deconvolution operation with unpooling) (Pooling is [originally] used to filter out noisy activations)
  • Using the inherent/ built in semantic information flow at different scales in CNNs (Feature Pyramids)
I will mention the related papers soon. :)

Wednesday, January 23, 2019

DDoC #05: Visual memory: What do you know about what you saw?

A random read on "Gist of a Scene" this time, by Dr. Jeremy M. Wolfe.

When it comes to remembering a scene, humans do not go through all the details of the scene. What matters is only the gist of the scene. However, what constitutes the scene gist is not agreed upon yet. Some of the finding on that research direction are as follows:

1. Change in appearance does not cause a scene gist, (e.g., people remember a scene of a two women talking irrespective of the color of the cloths they wear. This is called “change blindness”
2. Scene gist is not just a collection of objects, relationships between the objects in the scene also matter (e.g., milk being poured from a carton into a glass is not the same as a picture of milk being poured from a carton into the space next to a glass)
3. Scene gist involves some information about the spatial layout of the scene
4. Scene gist also involves the presence of unidentified objects (people do not see all the objects, but they know that certain objects should be there even if it is not visible)

You can find more information in his article.

Thursday, January 17, 2019

DDoC #04: Connecting Look and Feel

Today's paper is "Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials". (Figure 1: with a self pat on my back for working on this, even when I'm little sick+in a bad mood. And then I found this cool image!)

Figure 1: self pat on my back source:
http://massagebywil.com/2011/10/25/pat-yourself-on-the-back/
Why?
Humans use visual cues to infer material properties of objects. Further, touch is an important way of perception for both robots and humans to effectively interact with the outside world. 

What?
Project the input from different modalities in to a shared embedding space to associate visual and tactile information.  

How? 
Fabrics are clustered to different groups using K-nearest neighbor algorithm based on their physical properties such as thickness, stiffness, stretchiness and density. For humans, these fabrics in similar cluster will have similar properties. 

Clusters of Fabrics with different properties 

Input:
Different modalities of input image of fabric (depth, color and tactile images from touch sensor) 
Output: Determine the whether the different modalities are from same fabric or different fabrics 

Process: 
First, a low dimension representation (different embedding) of these input data is extracted using CNN. Then the distance between these different embeddings is measured. The idea is to have smaller distance for different modalities of the same fabric and to have a larger distance for different modalities of the different fabric.  


So, the goal of optimization function is to minimize the distance between different modalities of same fabric using contrastive loss (In layman terms, neighbors are pulled together and non-neighbors are pulled apart)  

More information can be found in their paper

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

My Daily Dose of Creativity (DDoC): Day 03 (No, not giving up yet :P)

Why?
Most of the computer vision algorithms focus about what's seen in image. e.g., Is there a curb ramp in this image? If so, where it is? We need to infer what is not there in the scene too. e.g., where could be a curb ramp in this image? Can there be a curb ramp in this image?
Where could be a curb ramp in the second image (green)?


What?
Learn contextual features to predict the possibility of having an object in the scene, even if the object cannot be seen clearly

How?

  • Training: Learn contextual features (e.g., the surrounding characteristics) for a given object category (e.g., curb ramps) using a binary classifier (to predict whether there can be an object in the given image or not). Positive examples are taken by masking out the object bounding box in images. Negative examples are taken by masking out a similar, corresponding area as positive examples using random image crops (to prevent network from learning the masking dimension). If there are any other objects around positive example, they are also being considered as context and hence multiple objects in one image are not considered. This constitutes the base of the training algorithm. Then, in order to mitigate the artifact issues due to the said classifier, another classifier is used without bounding box masks. The idea is to just ignore the object and let the network to implicitly learn the context. During training, both classification loss and distance loss (difference between the first explicit context classifier and second implicit context classifier) are taken in to consideration.  
  • Inference: First the object bounding boxes are detected. Pixels inside the object bounding boxes are marked as 0 and pixels outside bounding boxes are marked as 1. Then, a heat map that represents the context is generated where pixels with high probabilities are marked as 1 and low probabilities are marked as 0. After that, pixel-wise AND operation is performed between the afore-mentioned two representations.  


More information can be found in their paper

image source: http://openaccess.thecvf.com/content_cvpr_2017/papers/Sun_Seeing_What_Is_CVPR_2017_paper.pdf

Tuesday, January 15, 2019

DDoC #02: Discovering Causal Signals in Images

Why?
Modern computer vision techniques are so good at modeling the correlation between the input image pixels and image features. However, if we can perceive the causal structure of a given scene, it helps us to reason better about the real-world.

What?
Discovering Causal Signals in Images (source: https://arxiv.org/abs/1605.08179)
Novel observational discovery technique called "Neural Causation Coefficient (NCC)" to predict the direction of a causal relationship between pair of random variables.  Use NCC to distinguish between object features and context features.

How?
  • Causal feature (X > Y): a real world entity (X) that causes the presence of an object (Y) e.g.,  car is there in the presence of a bridge (because, it does not make sense to have a floating car on top of river without a bridge) 
  • Anti-causal feature (X < Y): a real world entity (X) that are caused by the presence of an object (Y) e.g., a wheel is there in the presence of a car
  • Object feature: Feature that is mostly activated inside the boundary box of an object (e.g., car)
  • Context feature: Feature that is mostly activated outside the boundary box of an object (e.g., background) 
NCC is learnt using a synthetic dataset to predict the direction (< or >, e.g., causal or anti-causal) of a given image feature-semantic category (class) pair. Then, the extracted causal and anti-causal features are used to verify if they relate to object features and context features. During their experiments, they have found that object features are mostly related to anti-causal features. Also, they have observed that context feature can either relate to causal or anti-causal features (e.g., road [context]-car vs car-shadow [context]) 

Application: detect object locations in a robust manner regardless of their context 

More information an be found in their paper

Monday, January 14, 2019

DDoC #01: Grad-CAM: Gradient based Class Activation Mapping

Daily dose of creativity (DDoC) is my attempt to learn something innovative/ creative on daily basis.  Today's paper is "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization".

Why?
Just having CNNs correctly classifying the given input images is not enough. We need an intuitive explanation on the reasons behinds the classification decision. Does CNN learn the right cues or does it learn some unrelated cue due to some biased information in the training data (e.g., unrelated background patterns)? Are they similar to how humans recognize objects in images?

What?
Grad-CAM for cat
We need to know which regions of an input image, CNN uses to classify a given image into a certain class. We also need to know what discriminative characteristics (e.g. patterns) in those images contributed mostly to classify the image (e.g., stripes in a tiger cat).

How?
Grad-CAM finds the importance of a given neuron when it comes to a particular class decision. For that, it computes the gradient of the class score with respect to the feature maps of convolution layer. The gradients flowing back are globally average pooled. They represent the importance of each activation map (weight) to a particular class label. Then Relu is applied on weighted (based on importance) combination of forward activation maps to derive the image regions with positive influence for a given class of interest. The importance of each region is projected as a heat map on the original image.

More information can be found in their paper.