Dev 007: Artificial Intelligence

Showing posts with label Artificial Intelligence. Show all posts

Saturday, January 16, 2021

On intelligence: How a New Understanding of the Brain will Lead to the Creation of Truly Intelligent Machines by Jeff Hawkins

I can't believe I've written only a single blog post in 2020. 😲 2021 supposed to be the year that I read more books. Let's see ;) I hope what happened to my #dailydoseofcreativity in 2019 won't happen again with my new year 2021 goal 🙈. Daily is a too overwhelming commitment, so that's a bad idea. Let's say bi-monthly to be on the safe side. Here's my first book, only read the first chapter and I'm already curious about his next book, 'Thousand Brains Theory of Intelligence', as well.

Here are my favorite quotes so far. 💘

Many people today believe that AI is alive and well and just waiting for enough computing power to deliver on its many promises. When computers have sufficient memory and processing power, the thinking goes, AI programmers will be able to make intelligent machines. I disagree. AI suffers from a fundamental flaw in that it fails to adequately address what intelligence is or what it means to understand something.

Turing machine: Its central dogma: the brain is just another kind of computer. It doesn't matter how you design an artificially intelligent system, it just has to produce human-like behavior.

Behaviorism: The behaviorists believed that it was not possible to know what goes on inside the brain, which they called an impenetrable black box. But one could observe and measure an animal's environments and its behaviors - what it senses and what it does, its inputs and outputs. They conceded that the brain contained reflex mechanisms that could be used to condition an animal into adopting new behaviors through rewards and punishments. But other than this, one did not need to study the brain, especially messy subjective feelings such as hunger, fear, or "what it means to understand something".

Behavior is a manifestation of intelligence, but not a central characteristic of being intelligent.

Sunday, October 28, 2018

On Creativity and Abstractions of Neural Networks

"Are GANs just a tool for human artists? Or are human artists at
the risk of becoming tools for GANs?"

Today we had a guest lecture titled "Creativity and Abstractions of Neural Networks" by David Ha (@HardMaru), Research Scientist at Google Brain, facilitated by Michal Fabinger.

Among all the interesting topics he discussed such as Sketch-RNN, Kanji-RNN and world models, what captivated me most is his ideas about abstraction, machine creativity and evolutional models. What exactly discussed on those topics (as I understood) is,

Generating images based on latent vectors in auto encoders is a useful way to understand how the network understands abstract representations about data. In world models [1], he has used RNN to predict the next latent vector which can think of as an abstract representation of the reality.

Creative machines learn and form new policies to survive or to perform better. This can be somewhat evolutionary (may be not during the life time of one agent). The agents can adopt to different scenarios by modifying them selves too (self-modifying agents).

Some other quotes or facts about human perception that (I think) has inspired his work.

Sketch-RNN [2]:

"The function of vision is to update the internal model of the world inside our head, but what we put on a piece of paper is the internal model" ~ Harold Cohen (1928 -2016), Reflections of design and building AARON

World Models:

"The image of the world around us, which we carry in our head, is just a model. Nobody in their head imagines all the world, government or country. We have only selected concepts, and relationships between them, and we use those to represent the real system." ~ Jay Write Forrester (1918-2016), Father of system dynamics

[1] https://worldmodels.github.io/
[2] https://arxiv.org/abs/1704.03477

Tuesday, July 3, 2018

Network Dissection to Divulge the Hidden Semantics of CNN

Needless to mention that nowadays deep convolutional neural networks (CNNs) have gained immense popularity due to its ability to classify or recognize scenes or objects with reasonable accuracy. However, we already know that CNNs can also be fooled by adversarial attacks, so that a given image, that was accurately recognized by a CNN earlier, can be altered in a way that even though its still possible for a human to recognize well, CNN would fail to do so [1]. So, the natural question arises "Are they genuinely learning about object or scenes like we humans do?"

Dissection

Researchers from MIT have recently conducted some experiments along that line as what's happening in hidden layers of CNNs still remains a mystery [2]. Their experiments aim to find out if those individual hidden units align with some human interpretable concepts such as parts of an object or objects in a scene. E.g., lamps (object detector unit) in place recognition, bicycle wheel (part detector) in object detection. If so, they need to find a way to quantify the emerged 'interpretability'. It's interesting to know that neurologists perform a similar task to uncover the behavior of biological neurons too.

Researchers have conducted experiments to find which factors (E.g., axis representation, training techniques) influences to interpretability of those hidden units too. They have found that interpretability is axis dependent, in the sense that if you change the rotation of a given image, the hidden units will no longer be interpretable. Further, different training techniques such as dropout or batch normalization have an impact on interpretability too.

You can find more details on this research here.

[1] https://kaushalya.github.io/DL-models-resistant-to-adversarial-attacks/
[2] D. Bau*, B. Zhou*, A. Khosla, A. Oliva, and A. Torralba. "Network Dissection: Quantifying Interpretability of Deep Visual Representations." Computer Vision and Pattern Recognition (CVPR), 2017. Oral.

Thursday, June 28, 2018

Look Closer to See Better

image source: wikipedia

Hearing about this recent research made me feel a little dumb, and hopefully you will feel the same too. But, anyways it's quite impressive to see the advanced tasks that machines are getting capable of. What we usually hear is that even though recognizing a cat is a simple task for humans, it is quite challenging task for a machine, or let's say.. for a computer.

Then, try to recognize what's in this image? If I was given this task, I would have just said that it's a 'bird', hopefully you would too, unless you are a bird expert or enthusiast. Of course it's a bird, but what if your computer is smart enough to say that it's a 'Laysan albatross' 😂Not feeling dumb enough yet? Seems like the computer is aware of which features in which areas of its body make it a 'Laysan albatross' too.

Even though, there exists some promising research on region detection and fine grained feature learning (E.g., find which region of this bird contain more discriminative features from other bird species and then learn those features, so that we can recognize the bird species of a new, previously unseen, image), they still have some limitations.

So this research [1] focuses on a method where the two components namely attention based region detection and fine grained feature learning strengthen or reinforce each other by giving them feedback to perform better as a whole. The first component starts by looking at the coarse grained features of a given image to identify which areas to pay more attention to. Then the second component will further analyze the fine grained details of those areas to learn what features make this area unique to this species. If the second component is struggling to make confident decisions on recognizing the bird species, then it will inform that to first model as the selected region might not be very accurate.

More information about this research can be found here.

[1] J. Fu, H. Zheng and T. Mei, "Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, 2017, pp. 4476-4484.

Sunday, June 24, 2018

What makes Paris look like Paris?

windows with railings in Paris

Cities have their own character. May be that’s what makes some cities notable than the others. In her award winning memoire “Eat, Pray, Love”, Elizebeth Gilbert mentions that there’s a ‘word' for each city. She assigns ‘Sex' for Rome, ‘Achieve' for NewYork, ‘Conform' for Stockholm (To add more cities that I have been to, how about ‘ Tranquility' for Kyoto, ‘Elegance' for Ginza and ‘Vibrant' for Shibuya?). When terrorists attacked Paris in 2015, more than 7 million people shared their support for Paris under the #PrayforParis hash tag within 10 hours. Have you ever thought what characteristics make a city feels the way it is? Can we make a machine that can ‘feel' the same way about cities as the humans do?

May be we are not there yet. Nevertheless, researchers from Carnegie Mellon University and Inria have taken an innovative baby step towards this research direction by asking the question “What makes Paris look like Paris?” [1]. Is it the Eiffel tower what makes Paris looks like Paris? How can we find if a given image is taken in Paris if the Eiffel tower is not present in that image?

To start with, they asked people who have been to Paris before, to recognize Paris from some other cities like London or Prague. Humans could achieve this task with significant level of accuracy. In order to make a machine that can perceive a city as the same way as a human does, first we need to figure out, "What characteristics of Paris help humans to perceive Paris as Paris?". So, their research focuses on automatically mining the frequently occurring patterns or characteristics (features) that make Paris geographically discriminative than the other cities. Even though, there can be both local and global features, the researchers have focused only on local, high dimensional features. Hence, image patches at different resolutions, represented as HOG+color descriptors are used for the experiments. Image patches are labeled as two sets namely Paris and non-Paris (London, Prague etc.) Initially, the non discriminative patches, things that can occur in any city such as cars or sidewalks, are eliminated using nearest neighborhood algorithm. If an image patch is similar to other image patches in ‘both' Paris set and non-Paris set, then that image patch is considered as not discriminative and vice versa.

Paris Window painting
by Janis McElmurry

However, the notion of “similarity” can be purely subjective when it comes to similarity between different aspects. So, the standard similarity measurements used in the nearest neighborhood algorithm might not represent the similarity between the elements from different cities well. Accordingly, researchers have come up with a distance or similarity metric that can be learned or adopted to find discriminative features using the available image patches in an iterative manner. This algorithm is executed with images from different cities such as Paris and Barcelona to find distinctive stylist elements of each city.

Interesting fact about this research (well, at least for myself) is artists can use these research findings as useful cues to better capture the style of a given place. More details about this research can be found here.

[1] Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A. Efros. What Makes Paris Look like Paris? ACM Transactions on Graphics (SIGGRAPH 2012), August 2012, vol. 31, No. 3.

Wednesday, August 2, 2017

Neuroscience inspired Computer Vision

Source: https://www.pinterest.com/explore/visual-cortex/

Having read the profound master piece “When breath becomes air”, by Neuroscientist – surgeon Paul Kalanithi, I was curious about how neuroscience could contribute to AI (Computer vision in particular).

Then, I found an comprehensive article in Neuron Review journal (written by Demis Hassabis, Dharshan Kumaran, Christopher Summerfield, Matthew Botvinick) titled “Neuroscience inspired Artificial Intelligence”. Here goes a brief excerpt of concepts I found inspiring in that article, related to computer vision.

Past

CNNS

How visual input is filtered and pooled into simple and complex areas of cells in area V1in visual cortex
Hierarchical organization of mammalian cortical systems

Object recognition

Transforming raw visual input into increasingly complex set of features - To achieve invariance towards pose, illumination and scale

Present

Attention

Visual attention shifts strategically among different objects (no equal priority for all objects) - To ignore irrelevant objects in a given scene in the presence of a clutter, multi object recognition, image to caption generation, generative models to synthasize images

Future

Intuitive understanding of physical world

Interpret and reason about scenes by decomposing them into individual objects and their relations
Redundency reduction (encourages the emergence of disentangled representations of independent factors such as shape and position) - To learn objectness, construct rich object models from raw inputs using deep generative models, E.g., Variational auto encoder

Efficient Learning

Rapidly learn new concepts from only a handful of examples (Related with Animal learning, developmental psychology)
Characters challenge - distinguish novel instances of an unfamiliar hand written character from another - "Learn to learn” networks

Transfer Learning

Generalizing or transferring generalized knowledge gained in one context to novel previously unseen domains (E.g., Human who can drive a car drives an unfamiliar vehicle) - Progressive networks
Neural coding using Grid codes in Mammalian entorhinal cortex - To formulate conceptual representations that code abstract, relational information among patterns of inputs (not just invariant features)

Virtual brain analytics

Increase the interpretability of AI computations, Determine response properties of units in a neural networks
Activity maximization - To generate synthetic images by maximizing the activity of certain classes of unit

From AI to neuroscience

Enhancing performances of CNNs has also yielded new insights into the nature of neural representations in high-level visual areas. E.g., 30 network architectures from AI to explain the structure of the neural representations observed in the ventral visual stream of humans and monkeys

Wednesday, May 6, 2015

Review of state of the art named entity recognition methods for different languages

1. Introduction

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organise information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

Due to the diverse language characteristics, NER can be declared as language/ domain specific task. For languages such as English and German, named entity recognition has been easier task when compared to asian languages due to beneficial orthographical features (E.g., nouns begin with a capital letter).

However, for most of the languages named entity recognition has been a challenging task due to lack of annotated corpora, complex morphological characteristics and homography.

Linguistic issues for South Asian languages are agglutinative nature, no capitalization, ambiguity, low POS tagging for accuracy, lack of good morphological analyzers, lack of named dictionaries, multiple ways of representing acronyms, free word order and spelling variation[2][4][9].

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

2. Different methods for NER

The current approaches for NER task can be categorized in to machine learning, rule based and hybrid methods.

2.1. Machine learning based methods

Machine learning/ statistical methods require large annotated data. This is less expensive than rule based methods when it comes to maintenance aspects as less expert knowledge is required. Extension of new names is a costly task for machine learning as re-training is required.

Machine learning techniques consider NER task as a sequence tagging problem.

2.1.2. Algorithms for NER

2.1.2.1. Supervised learning methods

Algorithm	Description
Conditional Random Fields (CRF) [2], [3], [6], [7], [9], [11], [13], [15]	Discriminative, Undirected graphical models, First order markov independence assumption, Conditional probability of labeled sequence More efficient than HMM for non-independent, diverse overlapping features of highly inflective languages Framework: CRF++ [10]
Hidden Markov Models (HMM) [9]
Maximum Entropy (MaxEnt) [1]
Maximum Entropy Marcov Model (MEMM)
Support Vector Machines (SVM) [5]

2.1.2.2. Semi-supervised learning methods

2.2. Rule based methods

Rule identification has to be done manually by linguistics and requires language specific knowledge. These methods include lexicalized grammar, gazetteer lists and list of trigger words. [2]

The rules generated for one language cannot be directly transferred to another language. Also, rule based methods do not perform well in ambiguous/ uncertain situations.

There can be positive and negative rules.

In [1], 36 rules are defined for time, measure and number classes. The rules contains corresponding entries for each language to act in language independent manner. In addition, semi automatics extraction of context patterns is used to refine the accuracy.

[2] , [3] have used rule based method to find nested tags to improve recall.

Regular expressions have been utilized in [4][ 5] to identify person names and organization names. In [7], dictionaries are used to locate, if part of the word presents in the dictionary.

Rule based NER engine is created using white list representing a dictionary of names and grammar in the form of regular expressions in [14]. Here, heuristic disambiguation technique is applied to get the correct choice when ambiguous situation arises.

2.3. Hybrid methods

It is specified in [1], [2] that hybrid system have been generally more effective in the task of NER with proven results.

In [1], a hybrid solution is suggested for NER which consist of base line NER system with MaxEnt model. To increase the performance, language specific rules and gazetteers are used. Further a set of rules have been applied to detect nested entities (E.g., district, town nested entities for location entity). Supported languages are Hindi, Bengali, Oriya, Telugu, Urdu.

Important findings of [1] suggests, if the available training set is small then using rule based methods can improve f-measure.

[2] has suggested machine learning based approach using Conditional Random Fields (CRFs) with feature induction and heuristics based rules as post processing mechanism for NER in South Asian languages.

Here, the tags which CRF has categorized as O (Other) are reconsidered for adherence to given rules and if confidence level exceeds a given threshold (E.g., 0.15) then the suggested tag is considered as the named entity instead of O. However, this approach has improved recall by 7% while causing slight decrease in precision (3%).

[3] has used hybrid approach for NER with CRFs, language rules and gazetteer lists.

CRF model is used with rule based methods in [4] for Telugu language.

In [8], 3 stage approach is suggested for NER task namely, use of NE dictionary, rules for named entity and left-right co-occurrence statistics. In the 3rd step, n-gram based named entity detection is performed. This approach is supervised method that relies in the co occurrence of left and right words.

CRF and HMM based hybrid approach is suggested in [9] for NER in Indian languages. It is concluded that when 2 statistical models are exploited, it gives better results than using only one approach.

[16] have used hybrid approach using 2 main steps. First, set of constraints are generated for each type such Person, Location and Organization. These are compiled by linguists and represented as FSA to generate most likely candidates. Then, these candidates will be assigned class probability and generative class model is created based on this. Transliteration is used to identify foreign names.

2.4 Referencing gazetteer lists

This is most simple and fastest method of named entity recognition. However, since named entities are numerous and constantly evolving, this approach itself has not been sufficient for effective NER task. However, in [5] it is found that incorporating gazetteer list can significantly improve the performance.

Gazetteer lists has been created in [1] using transliteration for month names, days, common locations, first name, middle name, last name etc.

[2] has used gazetteer lists of list of measures (kilogram, lacks), numerals and quantifiers (first, second) and time expressions (date, month, minutes, hours).

2.5 Other methods

A phonetic matching technique is harnessed in [12] for NER in Indian languages on the basis of similar sounding property. They have used Stanford NER as the reference entity database and have come up with a Hindi named entity database using a phonetic matcher.

In [17], external resources such as Wikipedia infobox features are used to infer entity name along with word clustering algorithm to partition words into classes based on their co occurrence statistics in a large corpera.

3. Feature Selection

When it comes to feature selection, available word and tag context plays a major role. Many systems seems to use binary features which represents the presence or absence of a given property of a word.

Static words (previous and next words), context lists (frequent words in a given window for a particular class, E.g., Location class: city, going to), dynamic NE tag (NE tag for previous word), first word, contains digit, numerical characters, affixes (word suffix, word prefix), root information of word, Part of Speech (POS) tag are used as features in [1] with MaxEnt model.

It is highlighted in [1] that window of (w-2, w+2) gives the best results. Further, it is evident in [1] that usage of complex feature set does not guarantee better result.

[2] has used language independent features such as window of the words (window size 5), statistical suffixes for person and location entities (extracted as lists and used as binary feature), prefixes (to avoid agglutinative nature/ postpositions), start of sentence and presence of digits.

Prefix and suffix information is used as features in [3] as Indian languages are highly inflected (window size 5). In addition, previous word tags, rare word (most frequent words in language) and POS tags are used. Here, Oriya, Urdu and Telingu languages have shown poor performance when compared to Hindi and Bengali due to poor language features.

In [4], “majority tag” is used as an additional feature, which uses contextual and frequency information of other tags that are literally similar, to label an unnamed tag.

In experiment results of [5], it is highlighted that [-3, +2] window size gives the optimal results and increasing the window size has decreased the f-measure.

4. Recognizing different entity types

Entity type	Method	Challenges
Person name	look up procedure, analyse local lexical context, looking at part of sequence of candidate words (name component) Features: POS tags, capitalization, decimal digits, bag of words, Left and right context Token legth	Name variations (same person referred in different names) - reuse of name parts, morphological variants prefixes etc., transliteration differences Person name can be proper noun
Organization	Use organization specific candidate words	Various ways of representing abbriviations
Place	Using gazetteer, trigger words (E.g., Newyork city)	Homographic with common names, historical variants, exonyms (foreign variants), endonyms (local variants)

5. Summary and Conclusion

CRF based/ Hybrid/ Chain of named entity recognizers/ Rule based methods as post processing mechanism

5. References

[1] A Hybrid Approach for Named Entity Recognition in Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[2] Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[3] Language Independent Named Entity Recognition in Indian Language: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[4] Named Entity Recognition for Telugu: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[5] Bengali Named Entity Recognition using Support Vector Machine: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[6] Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[7] A Character n-gram Based Approach for Improved Recall in Indian Language NER: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[8] An Experiment on Automatic Detection of Named Entities in Bangla: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[9] A Hybrid Named Entity Recognition System for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[10] CRF++: Yet Another CRF toolkit: http://crfpp.googlecode.com/svn/trunk/doc/index.html

[11] Named Entity Recognition for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[12] Named Entity Recognition for Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[13] Experiments in Telugu NER: A Conditional Random Field Approach: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[14] NERA: Named Entity Recognition for Arabic: Journal of the American Society for Information Science and Technology: Volume 60 Issue 8, August 2009 Pages 1652-1663

[15] Integrated Machine Learning Techniques for Arabic Named Entity Recognition: International Journal of Computer Science Issues (IJCSI) . Jul2010, Vol. 7 Issue 4, p27-36. 10p. 2 Charts, 11 Graphs.

[16] Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach: Microsoft Research - China

[17] A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters

Thursday, March 5, 2015

Error: Could not find or load main class jade.Boot

java -cp "jade.jar:jadeTools.jar:http.jar" jade.Boot -gui

Saturday, February 22, 2014

Named Entity Recognition using Conditional Random Fields (CRF)

Named Entity Recognition

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organize information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

E.g., When translating "Sinhala text to English text", we need to figure out what are person names, locations in that text, so that we can avoid the overhead of finding corresponding English meaning for them. This is also helpful in question answering scenarios such as "Where Enrique was born?"

Different methods such as rule based systems, statistical and gazetteers have been used in NER task, however, statistical approaches have been more prominent and other methods are used to refine the results as post processing mechanism.

In computational statistics, NER has been identified as sequence labeling task and Conditional Random Fields (CRF ) has been successfully used to implement this.

In this article I will use CRF++ to explain how to implement a named entity recognizer using a simple example.

Consider the following input sentence:
"Enrique was born in Spain"

Now, by looking at this sentence any human can understand that Spain is a Location. But machines are unable to do so without previous learning.

So, to learn the computer we need to identify a set of features that links the aspects of what we observe in this sentence with the class we want to predict, which in this case "Location". How can we do that?

Considering the token/ word "Spain" itself is not sufficient to decide that it is a location in a generic manner. So, we consider its "context" as well which includes the previous/ next word, it's POS tag, previous NE tag etc. to infer the NE tag for token "Spain".

Feature Template and Training dataset

In this example, I will use "previous word" as the feature. So, we will define this in feature template as given below:

# Unigram
U00:%x[-1,0]

# Bigram
B

U00 is unique id to identify the feature.

I will explain %x[row, column] using the following sentence that we going to train the model.
I live in Colombo

First, we need to define the sentence according to the following format. (training.data)
I O
live O
in O
Colombo Location

current word: Colombo
-1: in
0: first column (Here, I have given only one column. But new columns are added when we define more features such as POS tag)
In the above training file last column refers to the answers we give to model NE task.

So, this feature indicates the model that after the word "in", it is "likely" to find a "Location".
Now we train the model:

crf_learn template train.data model

Model file is generated using feature template and the training data file.

Inference

Now we need to know if the following sentence has any important entities such as Location.
"Enrique was born in Spain"

We need to format input file also according to the above format. (test.data)
Enrique
was
born
in
Spain

Now we use the following command to test the model.

crf_test -m model test.data

Outcome would be the following:

Enrique O
was O
born O
in O
Spain Location

Likewise, the model will give predictions on entities present in the input files based on the given features and available training data.

Note: Check the Unicode compatibility for different languages. E.g., for Sinhala Language it's UTF-7.

Coming up next...

Probabilistic Graphic Models
Conditional Probability
Finite State Automata
First order markov independence assumption

Source code:
https://bitbucket.org/jaywith/sinhala-named-entity-recognition

Jayani Withanawasam

Saturday, January 11, 2014

Improve your Maths skills for artificial intelligence

I want to pursue "Artificial Intelligence" as my future career. So, was thinking of a way improve my maths skills. Then I found this awesome site!

https://www.khanacademy.org

Khan academy will first give you are maths pretest to assess your skills. Based on that you will achieve some points. And then you got to do the other organized set or questions one by one. If you stuck in some question, then they will give you a hint. If that does not work, there is a video demo for you to learn the area related to that question.

I find it very effective way to learn, so try this out!

Thank you so much Khan Academy, Good work!!!! :)

Thursday, December 19, 2013

Topic Modeling: Infer topics for documents using Latent Dirichlet Allocation (LDA)

Introduction to Latent Dirichlet Allocation (LDA)

In LDA model, first you need to create a vocabulary on probabilistic term distribution over each topic using a set of training documents.

In a simple scenario, assume there are 2 documents in the training set and their content has following unique, important terms. (Important terms is extracted using TF vectors as I have mentioned later)

Document 1: "car", "hybrid", "Toyota"
Document 2: "birds", "parrot", "Sri Lanka"

Using the above terms, LDA creates a vocabulary on probabilistic term distribution over each topic as given below: We define that we need to form 2 topics from this training content.

Topic 1: car: 0.7, hybrid: 0.1, Toyota: 0.1, birds: 0.02, parrot: 0.03, Sri Lanka: 0.05

Topic 1: Term-Topic distribution

Topic 2: car: 0.05, hybrid: 0.03, Toyota: 0.02, birds: 0.4, parrot: 0.5, Sri Lanka: 0.1

Topic 2: Term-Topic distribution

The topic model is created based on above training data which will be later used for inference.

For a new document, you need to infer the probabilistic topic distribution over document. Assume the document content is as follows:

Document 3: "Toyota", "Prius", "Hybrid", "For sale", "2003"

For the above document, probabilistic topic distribution over document will (roughly!) be a value like this:

Topic 1: 0.99, Topic 2: 0.01

Topic distribution over the new document

So, we can use the terms in the topics with high probability (E.g., car, hybrid) as metadata for the document which can be used in different applications such as search indexing, document clustering, business analytic etc.

Pre-processing

Preparing input TF vectors

To bring out the important words within a document, we normally use TF-IDF vectors. However, in LDA, TF vectors are used instead of TF-IDF words to recognize the co-occurrence or correlation between words.

(In vector space model [VSM] it is assumed that occurrences of the words are independent of each other, but this assumption is wrong in many cases! n-gram generation is a solution for this problem)

Convert input documents to SequenceFile format

sequence file is a flat file consisting of binary key value pairs. This is used as input/ output file format for map-reduce jobs in Hadoop, which is the underlying framework which Mahout is running on.

        Configuration conf = new Configuration();
        HadoopUtil.delete(conf, new Path(infoDirectory));
        SequenceFilesFromDirectory sfd = new SequenceFilesFromDirectory();

        // input: directory contains number of text documents
        // output: the directory where the sequence files will be created
        String[] para = { "-i", targetInputDirectoryPath, "-o", sequenceFileDirectoryPath };
        sfd.run(para);

Convert sequence files to TF vectors

Configuration conf = new Configuration();

Tokenization and Analyzing

During the tokenization, document content will be split in to set of terms/tokens. Different analyzers may use different tokenizers. Stemming and removing stop words can be done and customized in this stage. Please note that both stemming and stop words are language dependent.

You can specify your own analyzer if you want, specifying on how you want the terms to be extracted. That has to be extended by the Lucene Analyzer class.

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_43);

DocumentProcessor.tokenizeDocuments(new Path(sequenceFileinputDirectoryPath + "/" + "part-m-00000"), analyzer.getClass().asSubclass(Analyzer.class),
                new Path(infoDirectory + "/" + DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER), conf);
        analyzer.close();

There are couple of important parameters for generating TF vectors.

In mahout, DictionaryVectorizer class is used for TF weighting and n-gram collocation.

// Minimum frequency of the term in the entire collection to be considered as part of the dictionary file. Terms with lesser frequencies are ignored.
        int minSupport = 5;

// Maximum size of n-grams to be selected. For more information, visit: ngram collocation in Mahout
        int maxNGramSize = 2;

// Minimum log likelihood ratio (This is related to ngram collocation. Read more here.)
// This work only when maxNGramSize > 1 (Less significant ngrams have lower score here)
        float minLLRValue = 50;

// Parameters for Hadoop map reduce operations
        int reduceTasks = 1;
        int chunkSize = 200;
        boolean sequentialAccessOutput = true;

    DictionaryVectorizer.createTermFrequencyVectors(new Path(infoDirectory + DocumentProcessor.TOKENIZED_DOCUMENT_OUTPUT_FOLDER),
                new Path(infoDirectory), DictionaryVectorizer.DOCUMENT_VECTOR_OUTPUT_FOLDER, conf, minSupport, maxNGramSize, minLLRValue,
                -1.0f, false, reduceTasks, chunkSize, sequentialAccessOutput, true);

Once the TF vectors are generated for each training document, the model can be created.

Training

Generate term distribution for each topic and generate topic distribution for each training document
(Read about the CVB algorithm in mahout here.)

CVB0Driver cvbDriver = new CVB0Driver();

I will explain the parameters and how you need to assign them values. Before that you need to read the training dictionary in to memory as given below:

Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path(
                dictionaryFilePath), conf);
        Text key = new Text();
        IntWritable val = new IntWritable();
        ArrayList dictLst = new ArrayList();
        while (reader.next(key,val)) {
            System.out.println(key.toString()+" "+val.toString());
            dictLst.add(key.toString());
        }
        String[] dictionary = new String[dictLst.size()];
        dictionary = dictLst.toArray(dictionary);

Then, you have to convert vector representation of documents to a matrix, like this.
        RowIdJob rowidjob = new RowIdJob();
       String[] para = { "-i", inputVectorPath, "-o",
               TRAINING_DOCS_OUTPUTMATRIX_PATH };
       rowidjob.run(para);

Now, I will explain each parameters and factors you should consider on deciding values.

// Input path to the above created matrix using TF vectors
Path inputPath = new Path(TRAINING_DOCS_OUTPUTMATRIX_PATH + "/matrix");

// Path to save the model (Note: You may need this during inferring new documents)
Path topicModelOutputPath = new Path(TRAINING_MODEL_PATH);

// Numbe of topics (#important!) Lower value results in broader topics and higher value may result in niche topics. Optimal value for this parameter can vary depending on the given use case. Large number of topics may cause the system to slowdown.
int numTopics = 2;

// Number of terms in the training dictionary. Here's the method to read that:
private static int getNumTerms(Configuration conf, Path dictionaryPath) throws IOException {
    FileSystem fs = dictionaryPath.getFileSystem(conf);
    Text key = new Text();
    IntWritable value = new IntWritable();
    int maxTermId = -1;
    for (FileStatus stat : fs.globStatus(dictionaryPath)) {
      SequenceFile.Reader reader = new SequenceFile.Reader(fs, stat.getPath(), conf);
      while (reader.next(key, value)) {
        maxTermId = Math.max(maxTermId, value.get());
      }
      reader.close();
    }

    return maxTermId + 1;
}

int numTerms = getNumTerms(conf, new Path(TRAINING_DOCS_ROOT_PATH + "dictionary.file-0"));

// Smoothing parameters for p(topic|document) prior: This value can control how term topic likelihood is calculated for each document
        double alpha = 0.0001;
       double eta = 0.0001;
       int maxIterations = 10;
       int iterationBlockSize = 10;
       double convergenceDelta = 0;
       Path dictionaryPath = new Path(TRAINING_DOCS_ROOT_PATH + "dictionary.file-0");

// Final output path for probabilistic topic distribution training documents
       Path docTopicOutputPath = new Path(TRAINING_DOCS_TOPIC_OUTPUT_PATH);

// Temporary output path for saving models in each iteration
       Path topicModelStateTempPath = new Path(TRAINING_MODEL_TEMP_PATH);

       long randomSeed = 1;

// This is a measurement of how well a probability distribution or probability model predicts a sample. LDA is a generative model, you start with a known model and try to explain the data by refining parameters to fit the model of the data. These values can be taken to evaluate the performance.
       boolean backfillPerplexity = false;

       int numReduceTasks = 1;
       int maxItersPerDoc = 10;
       int numUpdateThreads = 1;
       int numTrainThreads = 4;
       float testFraction = 0;

       cvbDriver.run(conf, inputPath, topicModelOutputPath,
               numTopics, numTerms, alpha, eta, maxIterations, iterationBlockSize, convergenceDelta, dictionaryPath, docTopicOutputPath, topicModelStateTempPath, randomSeed, testFraction, numTrainThreads, numUpdateThreads, maxItersPerDoc, numReduceTasks, backfillPerplexity)   ;

Once this step is completed the training phase of topic modeling is over. Now, lets see how to infer new documents using the trained model.

Topic Inference for new document

To infer topic distribution for new document, you need to follow the same steps for the new document which I have mentioned earlier.

Pre-processing - stop word removal
Convert the document to sequence file format
Convert the content in the sequence file to TF vectors

There is an important step here, (Even I missed this step at the first time and got wrong results as the outcome :( )

We need to map the new document's dictionary with the training documents' dictionary and identify the common terms, that appears in both. Then, a TF vector needs to be created for the new document with the cardinality of training documents' dictionary. This is how you should do that.

        //Get the model dictionary file
                HashMap modelDictionary = new HashMap<>();
                SequenceFile.Reader reader = new SequenceFile.Reader(fs, new Path("reuters-dir/dictionary.file-0"), conf);
                Text keyModelDict = new Text();
                IntWritable valModelDict = new IntWritable();
                int cardinality = 0;
                while(reader.next(keyModelDict, valModelDict)){
                    cardinality++;
                    modelDictionary.put(keyModelDict.toString(), Integer.parseInt(valModelDict.toString()));
                }

                RandomAccessSparseVector newDocVector = new RandomAccessSparseVector(cardinality);

                reader.close();

        //Get the new document dictionary file
                ArrayList newDocDictionaryWords = new ArrayList<>();
                reader = new SequenceFile.Reader(fs, new Path("reuters-test-dir/dictionary.file-0"), conf);
                Text keyNewDict = new Text();
                IntWritable newVal = new IntWritable();
                while(reader.next(keyNewDict,newVal)){
                    System.out.println("Key: "+keyNewDict.toString()+" Val: "+newVal);
                    newDocDictionaryWords.add(keyNewDict.toString());
                }

                //Get the document frequency count of the new vector
                HashMap newDocTermFreq = new HashMap<>();
                reader = new SequenceFile.Reader(fs, new Path("reuters-test-dir/wordcount/ngrams/part-r-00000"), conf);
                Text keyTFNew = new Text();
                DoubleWritable valTFNew = new DoubleWritable();
                while(reader.next(keyTFNew, valTFNew)){
                    newDocTermFreq.put(keyTFNew.toString(), Double.parseDouble(valTFNew.toString()));
                }

                //perform the process of term frequency vector creation
                for (String string : newDocDictionaryWords) {
                    if(modelDictionary.containsKey(string)){
                        int index = modelDictionary.get(string);
                        double tf = newDocTermFreq.get(string);
                        newDocVector.set(index, tf);
                    }
                }
                System.out.println(newDocVector.asFormatString());

Read the model (Term distribution for each topic)

// Dictionary is the training dictionary

    double alpha = 0.0001; // default: doc-topic smoothing
    double eta = 0.0001; // default: term-topic smoothing
    double modelWeight = 1f;

TopicModel model = new TopicModel(conf, eta, alpha, dictionary, 1, modelWeight, TRAINING_MODEL_PATH));

Infer topic distribution for the new document

The final result, which is probabilistic topic distribution over new document will be stored in this vector
If you have a prior guess as to what the topic distribution should be, you can start with it here, instead of the uniform prior

        Vector docTopics = new DenseVector(new double[model.getNumTopics()]).assign(1.0/model.getNumTopics());

Empty matrix holding intermediate data - Term-Topic likelihoods for each term in the new document will be stored here.

        Matrix docTopicModel = new SparseRowMatrix(model.getNumTopics(), newDocVector.size());

int maxIters = 100;
        for(int i = 0; i < maxIters; i++) {
            model.trainDocTopicModel(newDocVector, docTopics, docTopicModel);
        }
    model.stop();

To be continued...

References: Mahout In Action, Wikipedia

Pages