Pages

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

My Daily Dose of Creativity (DDoC): Day 03 (No, not giving up yet :P)

Why?
Most of the computer vision algorithms focus about what's seen in image. e.g., Is there a curb ramp in this image? If so, where it is? We need to infer what is not there in the scene too. e.g., where could be a curb ramp in this image? Can there be a curb ramp in this image?
Where could be a curb ramp in the second image (green)?


What?
Learn contextual features to predict the possibility of having an object in the scene, even if the object cannot be seen clearly

How?

  • Training: Learn contextual features (e.g., the surrounding characteristics) for a given object category (e.g., curb ramps) using a binary classifier (to predict whether there can be an object in the given image or not). Positive examples are taken by masking out the object bounding box in images. Negative examples are taken by masking out a similar, corresponding area as positive examples using random image crops (to prevent network from learning the masking dimension). If there are any other objects around positive example, they are also being considered as context and hence multiple objects in one image are not considered. This constitutes the base of the training algorithm. Then, in order to mitigate the artifact issues due to the said classifier, another classifier is used without bounding box masks. The idea is to just ignore the object and let the network to implicitly learn the context. During training, both classification loss and distance loss (difference between the first explicit context classifier and second implicit context classifier) are taken in to consideration.  
  • Inference: First the object bounding boxes are detected. Pixels inside the object bounding boxes are marked as 0 and pixels outside bounding boxes are marked as 1. Then, a heat map that represents the context is generated where pixels with high probabilities are marked as 1 and low probabilities are marked as 0. After that, pixel-wise AND operation is performed between the afore-mentioned two representations.  


More information can be found in their paper

image source: http://openaccess.thecvf.com/content_cvpr_2017/papers/Sun_Seeing_What_Is_CVPR_2017_paper.pdf

No comments:

Post a Comment