Dev 007: computer vision

Showing posts with label computer vision. Show all posts

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Sometime back, I had a look at "how the global context is implicitly captured in recent holistic scene understanding methods?". Here's a small write-up about few things I observed.

image credits: https://quotefancy.com

CNNs are the most frequently/ recently used technique for scene understanding. The success of CNNs (applied for image classification) can be mostly attributed to the concepts of receptive field and locality. CNN based models are inherently translation invariant due to locality. Both convolution and pooling operations are capable of progressively increasing the receptive field. This helps to derive abstract concepts from low level features for better generalization. However, still, arguments are going on about the practical effective receptive field being much smaller than the theoretical receptive field. Further, only looking at small region of input image result in the loss of global spatial contextual information. Yet, global context is quite useful for holistic scene understanding, so it is important to have a wider/ broad look at the input image.

How to increase the receptive field in CNNs?

Adding multiple deeper layers
Incorporate multi-scale scene context

Inflating the size of the filter (a.k.a. dilated convolution)
Pooling at different scales (a.k.a. spatial pyramid pooling)

Semantic segmentation is the most widely used technique for holistic scene understanding. Fully Convolutional Networks (FCN) have adopted CNNs (that was initially used for image classification) for the task of semantic segmentation. Given an input image, semantic segmentation outputs a semantic segmentation mask that assigns a pre-determined semantic category to each pixel in the image. FCN achieves this by downsampling of image features followed by a rapid upsampling procedure to reconstruct the segmentation mask. However, this rapid upsampling procedure has lead to loss of contextual information. Recovering pixel level fine details from too coarse features (given as input for upsampling layer) is difficult.

How to upsample without loosing contextual information?

Learn to upsample while remembering the lost context (Deconvolution operation with unpooling) (Pooling is [originally] used to filter out noisy activations)
Using the inherent/ built in semantic information flow at different scales in CNNs (Feature Pyramids)

I will mention the related papers soon. :)

Wednesday, January 23, 2019

DDoC #05: Visual memory: What do you know about what you saw?

A random read on "Gist of a Scene" this time, by Dr. Jeremy M. Wolfe.

When it comes to remembering a scene, humans do not go through all the details of the scene. What matters is only the gist of the scene. However, what constitutes the scene gist is not agreed upon yet. Some of the finding on that research direction are as follows:

1. Change in appearance does not cause a scene gist, (e.g., people remember a scene of a two women talking irrespective of the color of the cloths they wear. This is called “change blindness”
2. Scene gist is not just a collection of objects, relationships between the objects in the scene also matter (e.g., milk being poured from a carton into a glass is not the same as a picture of milk being poured from a carton into the space next to a glass)
3. Scene gist involves some information about the spatial layout of the scene
4. Scene gist also involves the presence of unidentified objects (people do not see all the objects, but they know that certain objects should be there even if it is not visible)

You can find more information in his article.

Thursday, January 17, 2019

DDoC #04: Connecting Look and Feel

Today's paper is "Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials". (Figure 1: with a self pat on my back for working on this, even when I'm little sick+in a bad mood. And then I found this cool image!)

Figure 1: self pat on my back source:
http://massagebywil.com/2011/10/25/pat-yourself-on-the-back/

Why?

Humans use visual cues to infer material properties of objects. Further, touch is an important way of perception for both robots and humans to effectively interact with the outside world.

What?

Project the input from different modalities in to a shared embedding space to associate visual and tactile information.

How?

Fabrics are clustered to different groups using K-nearest neighbor algorithm based on their physical properties such as thickness, stiffness, stretchiness and density. For humans, these fabrics in similar cluster will have similar properties.

Clusters of Fabrics with different properties

Input: Different modalities of input image of fabric (depth, color and tactile images from touch sensor)

Output: Determine the whether the different modalities are from same fabric or different fabrics

Process:

First, a low dimension representation (different embedding) of these input data is extracted using CNN. Then the distance between these different embeddings is measured. The idea is to have smaller distance for different modalities of the same fabric and to have a larger distance for different modalities of the different fabric.

So, the goal of optimization function is to minimize the distance between different modalities of same fabric using contrastive loss (In layman terms, neighbors are pulled together and non-neighbors are pulled apart)

More information can be found in their paper.

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

My Daily Dose of Creativity (DDoC): Day 03 (No, not giving up yet :P)

Why?
Most of the computer vision algorithms focus about what's seen in image. e.g., Is there a curb ramp in this image? If so, where it is? We need to infer what is not there in the scene too. e.g., where could be a curb ramp in this image? Can there be a curb ramp in this image?

Where could be a curb ramp in the second image (green)?

What?
Learn contextual features to predict the possibility of having an object in the scene, even if the object cannot be seen clearly

How?

Training: Learn contextual features (e.g., the surrounding characteristics) for a given object category (e.g., curb ramps) using a binary classifier (to predict whether there can be an object in the given image or not). Positive examples are taken by masking out the object bounding box in images. Negative examples are taken by masking out a similar, corresponding area as positive examples using random image crops (to prevent network from learning the masking dimension). If there are any other objects around positive example, they are also being considered as context and hence multiple objects in one image are not considered. This constitutes the base of the training algorithm. Then, in order to mitigate the artifact issues due to the said classifier, another classifier is used without bounding box masks. The idea is to just ignore the object and let the network to implicitly learn the context. During training, both classification loss and distance loss (difference between the first explicit context classifier and second implicit context classifier) are taken in to consideration.
Inference: First the object bounding boxes are detected. Pixels inside the object bounding boxes are marked as 0 and pixels outside bounding boxes are marked as 1. Then, a heat map that represents the context is generated where pixels with high probabilities are marked as 1 and low probabilities are marked as 0. After that, pixel-wise AND operation is performed between the afore-mentioned two representations.

More information can be found in their paper.

image source: http://openaccess.thecvf.com/content_cvpr_2017/papers/Sun_Seeing_What_Is_CVPR_2017_paper.pdf

Tuesday, January 15, 2019

DDoC #02: Discovering Causal Signals in Images

Why?
Modern computer vision techniques are so good at modeling the correlation between the input image pixels and image features. However, if we can perceive the causal structure of a given scene, it helps us to reason better about the real-world.

What?

Discovering Causal Signals in Images (source: https://arxiv.org/abs/1605.08179)

Novel observational discovery technique called "Neural Causation Coefficient (NCC)" to predict the direction of a causal relationship between pair of random variables. Use NCC to distinguish between object features and context features.

How?

Causal feature (X > Y): a real world entity (X) that causes the presence of an object (Y) e.g., car is there in the presence of a bridge (because, it does not make sense to have a floating car on top of river without a bridge)
Anti-causal feature (X < Y): a real world entity (X) that are caused by the presence of an object (Y) e.g., a wheel is there in the presence of a car
Object feature: Feature that is mostly activated inside the boundary box of an object (e.g., car)
Context feature: Feature that is mostly activated outside the boundary box of an object (e.g., background)

NCC is learnt using a synthetic dataset to predict the direction (< or >, e.g., causal or anti-causal) of a given image feature-semantic category (class) pair. Then, the extracted causal and anti-causal features are used to verify if they relate to object features and context features. During their experiments, they have found that object features are mostly related to anti-causal features. Also, they have observed that context feature can either relate to causal or anti-causal features (e.g., road [context]-car vs car-shadow [context])

Application: detect object locations in a robust manner regardless of their context

More information an be found in their paper.

Monday, January 14, 2019

DDoC #01: Grad-CAM: Gradient based Class Activation Mapping

Daily dose of creativity (DDoC) is my attempt to learn something innovative/ creative on daily basis. Today's paper is "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization".

Why?
Just having CNNs correctly classifying the given input images is not enough. We need an intuitive explanation on the reasons behinds the classification decision. Does CNN learn the right cues or does it learn some unrelated cue due to some biased information in the training data (e.g., unrelated background patterns)? Are they similar to how humans recognize objects in images?

What?

Grad-CAM for cat

We need to know which regions of an input image, CNN uses to classify a given image into a certain class. We also need to know what discriminative characteristics (e.g. patterns) in those images contributed mostly to classify the image (e.g., stripes in a tiger cat).

How?
Grad-CAM finds the importance of a given neuron when it comes to a particular class decision. For that, it computes the gradient of the class score with respect to the feature maps of convolution layer. The gradients flowing back are globally average pooled. They represent the importance of each activation map (weight) to a particular class label. Then Relu is applied on weighted (based on importance) combination of forward activation maps to derive the image regions with positive influence for a given class of interest. The importance of each region is projected as a heat map on the original image.

More information can be found in their paper.

Sunday, October 28, 2018

On Creativity and Abstractions of Neural Networks

"Are GANs just a tool for human artists? Or are human artists at
the risk of becoming tools for GANs?"

Today we had a guest lecture titled "Creativity and Abstractions of Neural Networks" by David Ha (@HardMaru), Research Scientist at Google Brain, facilitated by Michal Fabinger.

Among all the interesting topics he discussed such as Sketch-RNN, Kanji-RNN and world models, what captivated me most is his ideas about abstraction, machine creativity and evolutional models. What exactly discussed on those topics (as I understood) is,

Generating images based on latent vectors in auto encoders is a useful way to understand how the network understands abstract representations about data. In world models [1], he has used RNN to predict the next latent vector which can think of as an abstract representation of the reality.

Creative machines learn and form new policies to survive or to perform better. This can be somewhat evolutionary (may be not during the life time of one agent). The agents can adopt to different scenarios by modifying them selves too (self-modifying agents).

Some other quotes or facts about human perception that (I think) has inspired his work.

Sketch-RNN [2]:

"The function of vision is to update the internal model of the world inside our head, but what we put on a piece of paper is the internal model" ~ Harold Cohen (1928 -2016), Reflections of design and building AARON

World Models:

"The image of the world around us, which we carry in our head, is just a model. Nobody in their head imagines all the world, government or country. We have only selected concepts, and relationships between them, and we use those to represent the real system." ~ Jay Write Forrester (1918-2016), Father of system dynamics

[1] https://worldmodels.github.io/
[2] https://arxiv.org/abs/1704.03477

Sunday, July 1, 2018

Taskonomy: Disentangling Task Transfer Learning

image source: Taskonomy [1]

The common computer vision tasks such as depth estimation, edge detection are usually performed in isolation.

While scanning through this year’s CVPR papers, I noticed this interesting research [1] (CVPR Best Paper award winner!) that introduced a term called “Taskonomy” (Task + Taxonomy).

Taskonomy focuses on deriving the relationships between these common computer vision tasks so that it can find some representations obtained by certain computer vision tasks that can be useful (in terms of efficient computation time and/ or requirement for less labeled data) in other computer vision tasks.

This is also known as ‘task transferability’. Some interesting visualizations and more information on this research can be found here.

[1] http://taskonomy.stanford.edu/

Thursday, June 28, 2018

Look Closer to See Better

image source: wikipedia

Hearing about this recent research made me feel a little dumb, and hopefully you will feel the same too. But, anyways it's quite impressive to see the advanced tasks that machines are getting capable of. What we usually hear is that even though recognizing a cat is a simple task for humans, it is quite challenging task for a machine, or let's say.. for a computer.

Then, try to recognize what's in this image? If I was given this task, I would have just said that it's a 'bird', hopefully you would too, unless you are a bird expert or enthusiast. Of course it's a bird, but what if your computer is smart enough to say that it's a 'Laysan albatross' 😂Not feeling dumb enough yet? Seems like the computer is aware of which features in which areas of its body make it a 'Laysan albatross' too.

Even though, there exists some promising research on region detection and fine grained feature learning (E.g., find which region of this bird contain more discriminative features from other bird species and then learn those features, so that we can recognize the bird species of a new, previously unseen, image), they still have some limitations.

So this research [1] focuses on a method where the two components namely attention based region detection and fine grained feature learning strengthen or reinforce each other by giving them feedback to perform better as a whole. The first component starts by looking at the coarse grained features of a given image to identify which areas to pay more attention to. Then the second component will further analyze the fine grained details of those areas to learn what features make this area unique to this species. If the second component is struggling to make confident decisions on recognizing the bird species, then it will inform that to first model as the selected region might not be very accurate.

More information about this research can be found here.

[1] J. Fu, H. Zheng and T. Mei, "Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 4476-4484.

Sunday, June 24, 2018

What makes Paris look like Paris?

windows with railings in Paris

Cities have their own character. May be that’s what makes some cities notable than the others. In her award winning memoire “Eat, Pray, Love”, Elizebeth Gilbert mentions that there’s a ‘word' for each city. She assigns ‘Sex' for Rome, ‘Achieve' for NewYork, ‘Conform' for Stockholm (To add more cities that I have been to, how about ‘ Tranquility' for Kyoto, ‘Elegance' for Ginza and ‘Vibrant' for Shibuya?). When terrorists attacked Paris in 2015, more than 7 million people shared their support for Paris under the #PrayforParis hash tag within 10 hours. Have you ever thought what characteristics make a city feels the way it is? Can we make a machine that can ‘feel' the same way about cities as the humans do?

May be we are not there yet. Nevertheless, researchers from Carnegie Mellon University and Inria have taken an innovative baby step towards this research direction by asking the question “What makes Paris look like Paris?” [1]. Is it the Eiffel tower what makes Paris looks like Paris? How can we find if a given image is taken in Paris if the Eiffel tower is not present in that image?

To start with, they asked people who have been to Paris before, to recognize Paris from some other cities like London or Prague. Humans could achieve this task with significant level of accuracy. In order to make a machine that can perceive a city as the same way as a human does, first we need to figure out, "What characteristics of Paris help humans to perceive Paris as Paris?". So, their research focuses on automatically mining the frequently occurring patterns or characteristics (features) that make Paris geographically discriminative than the other cities. Even though, there can be both local and global features, the researchers have focused only on local, high dimensional features. Hence, image patches at different resolutions, represented as HOG+color descriptors are used for the experiments. Image patches are labeled as two sets namely Paris and non-Paris (London, Prague etc.) Initially, the non discriminative patches, things that can occur in any city such as cars or sidewalks, are eliminated using nearest neighborhood algorithm. If an image patch is similar to other image patches in ‘both' Paris set and non-Paris set, then that image patch is considered as not discriminative and vice versa.

Paris Window painting
by Janis McElmurry

However, the notion of “similarity” can be purely subjective when it comes to similarity between different aspects. So, the standard similarity measurements used in the nearest neighborhood algorithm might not represent the similarity between the elements from different cities well. Accordingly, researchers have come up with a distance or similarity metric that can be learned or adopted to find discriminative features using the available image patches in an iterative manner. This algorithm is executed with images from different cities such as Paris and Barcelona to find distinctive stylist elements of each city.

Interesting fact about this research (well, at least for myself) is artists can use these research findings as useful cues to better capture the style of a given place. More details about this research can be found here.

[1] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What Makes Paris Look like Paris? ACM Transactions on Graphics (SIGGRAPH 2012), August 2012, vol. 31, No. 3.

Tuesday, June 19, 2018

Panoptic Segmentation

Panoptic segmentation is a topic that was discussed during our lab seminar recently, because it could potentially improve scene understanding in autonomous vehicles using vision sensors.

Successful approaches based on convolutional nets have been proposed previously for semantic segmentation task. Further, methods based on object or region proposals have become popular to detect individual objects as well.

Image source: [1]

The idea behind panoptic segmentation [1] is unifying the tasks of semantic segmentation (studying about 'stuff' such as sky, grass, regions) and instance segmentation using object detectors (studying about countable 'things', E.g., different instances of cars).

'Panoptic quality (PQ)' metric is proposed as a novel method to evaluate the proposed approach. More details about this can be found here and a simpler version here.

[1] Panoptic Segmentation, Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, Piotr Dollar

Wednesday, August 2, 2017

Neuroscience inspired Computer Vision

Source: https://www.pinterest.com/explore/visual-cortex/

Having read the profound master piece “When breath becomes air”, by Neuroscientist – surgeon Paul Kalanithi, I was curious about how neuroscience could contribute to AI (Computer vision in particular).

Then, I found an comprehensive article in Neuron Review journal (written by Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, Matthew Botvinick) titled “Neuroscience inspired Artificial Intelligence”. Here goes a brief excerpt of concepts I found inspiring in that article, related to computer vision.

Past

CNNS

How visual input is filtered and pooled into simple and complex areas of cells in area V1in visual cortex
Hierarchical organization of mammalian cortical systems

Object recognition

Transforming raw visual input into increasingly complex set of features - To achieve invariance towards pose, illumination and scale

Present

Attention

Visual attention shifts strategically among different objects (no equal priority for all objects) - To ignore irrelevant objects in a given scene in the presence of a clutter, multi object recognition, image to caption generation, generative models to synthasize images

Future

Intuitive understanding of physical world

Interpret and reason about scenes by decomposing them into individual objects and their relations
Redundency reduction (encourages the emergence of disentangled representations of independent factors such as shape and position) - To learn objectness, construct rich object models from raw inputs using deep generative models, E.g., Variational auto encoder

Efficient Learning

Rapidly learn new concepts from only a handful of examples (Related with Animal learning, developmental psychology)
Characters challenge - distinguish novel instances of an unfamiliar hand written character from another - "Learn to learn” networks

Transfer Learning

Generalizing or transferring generalized knowledge gained in one context to novel previously unseen domains (E.g., Human who can drive a car drives an unfamiliar vehicle) - Progressive networks
Neural coding using Grid codes in Mammalian entorhinal cortex - To formulate conceptual representations that code abstract, relational information among patterns of inputs (not just invariant features)

Virtual brain analytics

Increase the interpretability of AI computations, Determine response properties of units in a neural networks
Activity maximization - To generate synthetic images by maximizing the activity of certain classes of unit

From AI to neuroscience

Enhancing performances of CNNs has also yielded new insights into the nature of neural representations in high-level visual areas. E.g., 30 network architectures from AI to explain the structure of the neural representations observed in the ventral visual stream of humans and monkeys

Sunday, November 20, 2016

How to extract frames in a video using ffmpeg?

You can follow the steps given below to extract all frames of a video using ffmpeg tool.

Download ffmpeg package for your OS

https://www.ffmpeg.org/download.html#build-mac

Unzip the folder and move to that particular folder

cd /Lion_Mountain_Lion_Mavericks_Yosemite_El-Captain_04.11.2016

Extract frames using the following command:

./ffmpeg -i [your input video file] -r [frame rate] [output file format]

E.g.,

For a video names test.mp4 with frame rate of 8fps,

./ffmpeg -i test.mp4 -r 8/1 output%03d.jpeg

All the frames will be saved to the same directory which the command is executed.

Pages