Dev 007

Saturday, January 16, 2021

On intelligence: How a New Understanding of the Brain will Lead to the Creation of Truly Intelligent Machines by Jeff Hawkins

I can't believe I've written only a single blog post in 2020. 😲 2021 supposed to be the year that I read more books. Let's see ;) I hope what happened to my #dailydoseofcreativity in 2019 won't happen again with my new year 2021 goal 🙈. Daily is a too overwhelming commitment, so that's a bad idea. Let's say bi-monthly to be on the safe side. Here's my first book, only read the first chapter and I'm already curious about his next book, 'Thousand Brains Theory of Intelligence', as well.

Here are my favorite quotes so far. 💘

Many people today believe that AI is alive and well and just waiting for enough computing power to deliver on its many promises. When computers have sufficient memory and processing power, the thinking goes, AI programmers will be able to make intelligent machines. I disagree. AI suffers from a fundamental flaw in that it fails to adequately address what intelligence is or what it means to understand something.

Turing machine: Its central dogma: the brain is just another kind of computer. It doesn't matter how you design an artificially intelligent system, it just has to produce human-like behavior.

Behaviorism: The behaviorists believed that it was not possible to know what goes on inside the brain, which they called an impenetrable black box. But one could observe and measure an animal's environments and its behaviors - what it senses and what it does, its inputs and outputs. They conceded that the brain contained reflex mechanisms that could be used to condition an animal into adopting new behaviors through rewards and punishments. But other than this, one did not need to study the brain, especially messy subjective feelings such as hunger, fear, or "what it means to understand something".

Behavior is a manifestation of intelligence, but not a central characteristic of being intelligent.

Wednesday, September 23, 2020

ECCV 2020: My Takeaways

ECCV 2020 was my first virtual conference experience. There was a very fancy virtual conference environment that looked somewhat realistic :)

source: https://eccv.6connex.eu/

A workshop that I attended even had a funny avatar just for myself using Gather Town. I was mindlessly running here and there among virtual booths just because it was so much fun 😂. This time I wanted to focus more on the domains that I'm not familiar with, so I chose sessions accordingly. Needless to say that having so many interesting sessions was quite overwhelming (in a good way), so I had to first browse through everything and prioritize which ones to attend. Both conferences and 'workshops & tutorials' sites were quite well-organized, so it wasn't difficult to do so. Honestly, I felt that the virtual conference is more effective and efficient in so many ways if we forget about the sightseeing aspect of live conferences ;).

I mainly attended two workshops.

Computer vision for medical imaging: my research mainly focuses on human-like scene understanding. Digital camera sees the world somewhat similar to how a human would see. So, it was interesting to see machine perception from a different perspective, where we can see beyond the visible spectrum. There was a nice introductory session for newbies in medical imaging where they explained how to capture x-ray, ultra-sound, gamma, MRI and PET scans very comprehensively.

How AI will transform health-care?: A (stroke) clinicians view
Challenges and pitfalls in medical image analysis

Video Turing test: Toward Human-level Video Story Understanding: I attended two main sessions in this workshop.

10 questions for a theory of vision by Maro Gori: Using motion invariance [1] as a fundamental way to incorporate scale invariance, rotation invariance and deform invariance, etc. is quite innovative. Also, the ignorance of the "time" aspect for visual recognition (which I partly agree) was discussed. Of course, I clearly agree that the temporal dimension is important for scene understanding. However, as I feel humans can clearly recognize a single image, even an action or activity up to some extent without having multiple frames.
Common sense intelligence by Yejin Choi: common sense reasoning is clearly a brave topic as it is hard to clearly define, yet very important aspect of realizing AI. If I say that this talk completely blew my mind, I'm not exaggerating. There she discussed the gap between perceptual level and cognition level visual understanding and inferring about the dynamic state change of the world. The related paper to this talk is [2]. So glad to see that this talk somewhat supports my view about still image recognition (..feeling relieved lol 👻).

Highlights from the regular sessions:

A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses [3]:

I attended this session because metric learning is my new-found love 😍 The core idea is deep metric learning can be approached both using pairwise losses and cross-entropy. In both cases, minimizing the contrastive loss is equivalent to maximizing the mutual information between features and labels.

Grounded situation recognition [4] (AllenAI):

This paper clearly mentioned the issue with image captioning when it comes to human-like model evaluation. I also felt that when a human see a scene he understands semantic concepts rather than grammatically correct sentences. In addition to the main situation recognition task, they have proposed a few additional tasks such as conditional localization and grounded semantic chaining.

Women in computer vision

Panel discussions and mentoring sessions were quite useful to keep us motivated during challenging times. I'm so grateful that role models in the computer vision field (both men and women) took some time and effort to share their experience with us.

Push forward. You will find some way.
If you want something, just ask for it. Be prepared for rejection.
First, you should do some good work. Attention comes next.
Quality over quantity (don't try to be a paper factory)
Don't go after low hanging fruits (e.g., incremental research). Do something different and new.
History is important in any field.
Having hobbies not only helps you to relax. They help your work also.
Never give up :)

Industrial booths:

I got to know about Voxel 51 tool. If you are doing object detection related research, this tool might come in handy to deeply and easily analyze your detection results. Loved the "confidence slider" and "uniqueness" features [5].

Overall, it was such a great experience and kept me fascinated during the pandemic. It was like an "academic vacation" 💃. Kudos to the organizers who successfully organized a virtual conference for the first time.

[1] Betti, A., Gori, M. and Melacci, S., 2020. Learning visual features under motion invariance. Neural Networks.

[2] Park JS, Bhagavatula C, Mottaghi R, Farhadi A, Choi Y. VisualCOMET: Reasoning about the Dynamic Context of a Still Image.

[3] https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123510545.pdf

[4] https://arxiv.org/abs/2003.12058

[5] https://voxel51.com/docs/fiftyone/user_guide/brain.html

Tuesday, January 29, 2019

DDoC #06: How to capture the Global Context "Implicitly" using CNNs?

Sometime back, I had a look at "how the global context is implicitly captured in recent holistic scene understanding methods?". Here's a small write-up about few things I observed.

image credits: https://quotefancy.com

CNNs are the most frequently/ recently used technique for scene understanding. The success of CNNs (applied for image classification) can be mostly attributed to the concepts of receptive field and locality. CNN based models are inherently translation invariant due to locality. Both convolution and pooling operations are capable of progressively increasing the receptive field. This helps to derive abstract concepts from low level features for better generalization. However, still, arguments are going on about the practical effective receptive field being much smaller than the theoretical receptive field. Further, only looking at small region of input image result in the loss of global spatial contextual information. Yet, global context is quite useful for holistic scene understanding, so it is important to have a wider/ broad look at the input image.

How to increase the receptive field in CNNs?

Adding multiple deeper layers
Incorporate multi-scale scene context

Inflating the size of the filter (a.k.a. dilated convolution)
Pooling at different scales (a.k.a. spatial pyramid pooling)

Semantic segmentation is the most widely used technique for holistic scene understanding. Fully Convolutional Networks (FCN) have adopted CNNs (that was initially used for image classification) for the task of semantic segmentation. Given an input image, semantic segmentation outputs a semantic segmentation mask that assigns a pre-determined semantic category to each pixel in the image. FCN achieves this by downsampling of image features followed by a rapid upsampling procedure to reconstruct the segmentation mask. However, this rapid upsampling procedure has lead to loss of contextual information. Recovering pixel level fine details from too coarse features (given as input for upsampling layer) is difficult.

How to upsample without loosing contextual information?

Learn to upsample while remembering the lost context (Deconvolution operation with unpooling) (Pooling is [originally] used to filter out noisy activations)
Using the inherent/ built in semantic information flow at different scales in CNNs (Feature Pyramids)

I will mention the related papers soon. :)

Wednesday, January 23, 2019

DDoC #05: Visual memory: What do you know about what you saw?

A random read on "Gist of a Scene" this time, by Dr. Jeremy M. Wolfe.

When it comes to remembering a scene, humans do not go through all the details of the scene. What matters is only the gist of the scene. However, what constitutes the scene gist is not agreed upon yet. Some of the finding on that research direction are as follows:

1. Change in appearance does not cause a scene gist, (e.g., people remember a scene of a two women talking irrespective of the color of the cloths they wear. This is called “change blindness”
2. Scene gist is not just a collection of objects, relationships between the objects in the scene also matter (e.g., milk being poured from a carton into a glass is not the same as a picture of milk being poured from a carton into the space next to a glass)
3. Scene gist involves some information about the spatial layout of the scene
4. Scene gist also involves the presence of unidentified objects (people do not see all the objects, but they know that certain objects should be there even if it is not visible)

You can find more information in his article.

Thursday, January 17, 2019

DDoC #04: Connecting Look and Feel

Today's paper is "Connecting Look and Feel: Associating the Visual and Tactile Properties of Physical Materials". (Figure 1: with a self pat on my back for working on this, even when I'm little sick+in a bad mood. And then I found this cool image!)

Figure 1: self pat on my back source:
http://massagebywil.com/2011/10/25/pat-yourself-on-the-back/

Why?

Humans use visual cues to infer material properties of objects. Further, touch is an important way of perception for both robots and humans to effectively interact with the outside world.

What?

Project the input from different modalities in to a shared embedding space to associate visual and tactile information.

How?

Fabrics are clustered to different groups using K-nearest neighbor algorithm based on their physical properties such as thickness, stiffness, stretchiness and density. For humans, these fabrics in similar cluster will have similar properties.

Clusters of Fabrics with different properties

Input: Different modalities of input image of fabric (depth, color and tactile images from touch sensor)

Output: Determine the whether the different modalities are from same fabric or different fabrics

Process:

First, a low dimension representation (different embedding) of these input data is extracted using CNN. Then the distance between these different embeddings is measured. The idea is to have smaller distance for different modalities of the same fabric and to have a larger distance for different modalities of the different fabric.

So, the goal of optimization function is to minimize the distance between different modalities of same fabric using contrastive loss (In layman terms, neighbors are pulled together and non-neighbors are pulled apart)

More information can be found in their paper.

Wednesday, January 16, 2019

DDoC #03: Seeing What Is Not There: Learning Context to Determine Where Objects Are Missing

My Daily Dose of Creativity (DDoC): Day 03 (No, not giving up yet :P)

Why?
Most of the computer vision algorithms focus about what's seen in image. e.g., Is there a curb ramp in this image? If so, where it is? We need to infer what is not there in the scene too. e.g., where could be a curb ramp in this image? Can there be a curb ramp in this image?

Where could be a curb ramp in the second image (green)?

What?
Learn contextual features to predict the possibility of having an object in the scene, even if the object cannot be seen clearly

How?

Training: Learn contextual features (e.g., the surrounding characteristics) for a given object category (e.g., curb ramps) using a binary classifier (to predict whether there can be an object in the given image or not). Positive examples are taken by masking out the object bounding box in images. Negative examples are taken by masking out a similar, corresponding area as positive examples using random image crops (to prevent network from learning the masking dimension). If there are any other objects around positive example, they are also being considered as context and hence multiple objects in one image are not considered. This constitutes the base of the training algorithm. Then, in order to mitigate the artifact issues due to the said classifier, another classifier is used without bounding box masks. The idea is to just ignore the object and let the network to implicitly learn the context. During training, both classification loss and distance loss (difference between the first explicit context classifier and second implicit context classifier) are taken in to consideration.
Inference: First the object bounding boxes are detected. Pixels inside the object bounding boxes are marked as 0 and pixels outside bounding boxes are marked as 1. Then, a heat map that represents the context is generated where pixels with high probabilities are marked as 1 and low probabilities are marked as 0. After that, pixel-wise AND operation is performed between the afore-mentioned two representations.

More information can be found in their paper.

image source: http://openaccess.thecvf.com/content_cvpr_2017/papers/Sun_Seeing_What_Is_CVPR_2017_paper.pdf

Tuesday, January 15, 2019

DDoC #02: Discovering Causal Signals in Images

Why?
Modern computer vision techniques are so good at modeling the correlation between the input image pixels and image features. However, if we can perceive the causal structure of a given scene, it helps us to reason better about the real-world.

What?

Discovering Causal Signals in Images (source: https://arxiv.org/abs/1605.08179)

Novel observational discovery technique called "Neural Causation Coefficient (NCC)" to predict the direction of a causal relationship between pair of random variables. Use NCC to distinguish between object features and context features.

How?

Causal feature (X > Y): a real world entity (X) that causes the presence of an object (Y) e.g., car is there in the presence of a bridge (because, it does not make sense to have a floating car on top of river without a bridge)
Anti-causal feature (X < Y): a real world entity (X) that are caused by the presence of an object (Y) e.g., a wheel is there in the presence of a car
Object feature: Feature that is mostly activated inside the boundary box of an object (e.g., car)
Context feature: Feature that is mostly activated outside the boundary box of an object (e.g., background)

NCC is learnt using a synthetic dataset to predict the direction (< or >, e.g., causal or anti-causal) of a given image feature-semantic category (class) pair. Then, the extracted causal and anti-causal features are used to verify if they relate to object features and context features. During their experiments, they have found that object features are mostly related to anti-causal features. Also, they have observed that context feature can either relate to causal or anti-causal features (e.g., road [context]-car vs car-shadow [context])

Application: detect object locations in a robust manner regardless of their context

More information an be found in their paper.

Monday, January 14, 2019

DDoC #01: Grad-CAM: Gradient based Class Activation Mapping

Daily dose of creativity (DDoC) is my attempt to learn something innovative/ creative on daily basis. Today's paper is "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization".

Why?
Just having CNNs correctly classifying the given input images is not enough. We need an intuitive explanation on the reasons behinds the classification decision. Does CNN learn the right cues or does it learn some unrelated cue due to some biased information in the training data (e.g., unrelated background patterns)? Are they similar to how humans recognize objects in images?

What?

Grad-CAM for cat

We need to know which regions of an input image, CNN uses to classify a given image into a certain class. We also need to know what discriminative characteristics (e.g. patterns) in those images contributed mostly to classify the image (e.g., stripes in a tiger cat).

How?
Grad-CAM finds the importance of a given neuron when it comes to a particular class decision. For that, it computes the gradient of the class score with respect to the feature maps of convolution layer. The gradients flowing back are globally average pooled. They represent the importance of each activation map (weight) to a particular class label. Then Relu is applied on weighted (based on importance) combination of forward activation maps to derive the image regions with positive influence for a given class of interest. The importance of each region is projected as a heat map on the original image.

More information can be found in their paper.

Sunday, October 28, 2018

On Creativity and Abstractions of Neural Networks

"Are GANs just a tool for human artists? Or are human artists at
the risk of becoming tools for GANs?"

Today we had a guest lecture titled "Creativity and Abstractions of Neural Networks" by David Ha (@HardMaru), Research Scientist at Google Brain, facilitated by Michal Fabinger.

Among all the interesting topics he discussed such as Sketch-RNN, Kanji-RNN and world models, what captivated me most is his ideas about abstraction, machine creativity and evolutional models. What exactly discussed on those topics (as I understood) is,

Generating images based on latent vectors in auto encoders is a useful way to understand how the network understands abstract representations about data. In world models [1], he has used RNN to predict the next latent vector which can think of as an abstract representation of the reality.

Creative machines learn and form new policies to survive or to perform better. This can be somewhat evolutionary (may be not during the life time of one agent). The agents can adopt to different scenarios by modifying them selves too (self-modifying agents).

Some other quotes or facts about human perception that (I think) has inspired his work.

Sketch-RNN [2]:

"The function of vision is to update the internal model of the world inside our head, but what we put on a piece of paper is the internal model" ~ Harold Cohen (1928 -2016), Reflections of design and building AARON

World Models:

"The image of the world around us, which we carry in our head, is just a model. Nobody in their head imagines all the world, government or country. We have only selected concepts, and relationships between them, and we use those to represent the real system." ~ Jay Write Forrester (1918-2016), Father of system dynamics

[1] https://worldmodels.github.io/
[2] https://arxiv.org/abs/1704.03477

Wednesday, July 4, 2018

teamLab: Blurring the Boundaries between Art and Science

My Lizard Painting

Yesterday, we visited MORI building digital art museum: teamLab Borderless (This name is quite long and too hard to remember in the right order :P) which was opened recently in Odaiba... to make Odaiba, or Tokyo for that matter, even greater!

Even though some exhibits look a bit trivial in the beginning, (I felt that the exhibition ticket was somewhat over priced, although it was at discounted price and regardless of the fact that we did not pay for it), a second thought after further reading made me feel so overwhelmed, impressed and fascinated about the extent of innovation, creativity and philosophical thoughts that they have put together in to each piece of art.

This museum gives us a great feel of how digital involvement can nicely complement the traditional forms of art and overcome their inherent limitations. The museum is based on few great concepts. One such concept that highly captivated my curious (... well, about perception, in its all forms) mind is their notion on ultra subjective space. Comparing that concept with the western perspective of paintings and ancient Japanese spatial recognition made the idea even more lucrative.

If you are planning to visit this museum, I highly recommend that you understand those concepts before you visit the museum, in addition to other "things you should read before you visit", to make your museum experience even better.

On the other hand, some activities were quite fun too. Look at how I painted a cute lizard and the way it came alive a little later with all those "natural lizard like moves"!

Pages