Pages

Tuesday, September 15, 2015

Issues when setting up DeepLearning4J in Mac OSX

I got the following errors when I was trying out DeepLearning4J example with Deep Belief Nets (DBNs) in Mac OSX 10.10 (Yosemite)
It is being said that Jblas is already available in Mac OSX, but still I got some errors with that. 

Jblas is a pre-requisite for setting up DeepLearning4J. 


DeepLearning4J uses ND4J to enable scientific computing with N-Dimentional arrays for Java. ND4J works on several backend linear algebra libraries (execution support with CPU or GPU). Jblas is one Java backend used in  DeepLearning4J for the required matrix operations. 

NoAvailableBackendException ND4J 

Solution: Add the following dependancy

org.nd4j
nd4j-jblas
0.4-rc0

java.lang.ClassNotFoundException: org.jblas.NativeBlas 

Solution: Add the following dependancy

  org.jblas
  jblas
  1.2.4

Saturday, August 15, 2015

Latent Dirichlet Allocation (LDA) with Apache Spark MLlib

Latent Dirichlet allocation is an scalable machine learning algorithm for topic annotation or topic modelling. It is available in Apache Spark MLlib. I will not explain the internals of the algorithm in detail here.

Please visit the following link for more information about LDA algorithm.
http://jayaniwithanawasam.blogspot.com/2013/12/infer-topics-for-documents-using-latent.html

Here’s the code for LDA algorithm in Spark MLlib.
import scala.Tuple2;

import org.apache.spark.api.java.*;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.mllib.clustering.DistributedLDAModel;

import org.apache.spark.mllib.clustering.LDA;

import org.apache.spark.mllib.linalg.Vector;

import org.apache.spark.mllib.linalg.Vectors;

import org.apache.spark.SparkConf;

public class lda {

  public static void main(String[] args) {



// Spark configuration details

    SparkConf conf = new SparkConf().setAppName("LDA");

    JavaSparkContext sc = new JavaSparkContext(conf);



    // Load and parse the data (sample_lda_data.txt is available with Spark installation)

    // word count vectors (columns: terms [vocabulary], rows [documents])

    String path = "data/mllib/sample_lda_data.txt";

   

    // Read data

    // creates a RDD with each line as an element

    // E.g., 1 2 6 0 2 3 1 1 0 0 3

    JavaRDD data = sc.textFile(path);

   

    // Map is a transformation that passes each data set element through a function

    // It returns a new RDD representing the results

    // Prepares input as numerical representation

    JavaRDD parsedData = data.map(

        new Function() {


public Vector call(String s) {

            String[] sarray = s.trim().split(" ");

            double[] values = new double[sarray.length];

            for (int i = 0; i < sarray.length; i++)

              values[i] = Double.parseDouble(sarray[i]);

            return Vectors.dense(values);

          }

        }

    );

   

    // Index documents with unique IDs

    // The transformation 'zipWithIndex' provides a stable indexing, numbering each element in its original order.

    JavaPairRDD corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(

        new Function, Tuple2>() {



public Tuple2 call(Tuple2 doc_id) {

            return doc_id.swap();

          }

        }

    ));

    corpus.cache();



    // Cluster the documents into three topics using LDA

    // number of topics = 3

    DistributedLDAModel ldaModel = new LDA().setK(3).run(corpus);



    // Topic and its term distribution

    // columns = 3 topics/ rows = terms (vocabulary)

    System.out.println("Topic-Term distribution: \n" + ldaModel.topicsMatrix());

    // document and its topic distribution

    // [(doc ID: [topic 1, topic 2, topic3]), (doc ID: ...]

    JavaRDD> topicDist = ldaModel.topicDistributions().toJavaRDD();

    System.out.println("Document-Topic distribution: \n" + topicDist.collect());

    sc.close();

  }

}

Output:

Topic-Term distribution




Document-Topic distribution

Market Basket Analysis with Apache Spark MLlib FP-Growth

Market Basket Analysis 

source: http://www.noahdatatech.com/solutions/predictive-analytics/

Market basket analysis is identifying items in the supermarket which customers are more likely to buy together.
e.g., Customers who bought pampers also bought beer

      
This is important for super markets to arrange their items in a consumer convenient manner as well as to come up with promotions taking item affinity in to consideration.

Frequent Item set Mining and Association Rule Learning  


Frequent item set mining is a sub area in data mining that focuses on identifying frequently co-occuring items. Once, the frequent item set is ready, we can come up with rules         to derive association between items.
        e.g., Frequent item set = {pampers, beer, milk}, association rule = {pampers, milk ---> beer}

        There are two possible popular approaches for frequent item set mining and association rule learning as given below:

Apriori algorithm 
FP-Growth algorithm

To explain above algorithms, let us consider example with 4 customers making 4 transactions in supermarket that contain 7 items in total as given below:

    Transaction 1: Jana’s purchase: egg, beer, pampers, milk
    Transaction 2: Abi’s purchase: carrot, milk, pampers, beer
    Transaction 3: Mahesha’s purchase: perfume, tissues, carrot
    Transaction 4: Jayani’s purchase: perfume, pampers, beer

    Item index
    1: egg, 2: beer, 3: pampers, 4: carrot, 5: milk, 6: perfume, 7: tissues

Using Apriori algorithm


Apriori algorithm identifies frequent item sets by starting individual items and  extending item set by one at a time. This is known as candidate generation step.
This algorithm makes the assumption that any sub set of item within a frequent item set is also frequent.

Transaction: Items
1: 1, 2, 3, 5
2: 4, 5, 3, 2
3: 6, 7, 4
4: 6, 3, 2

Minimum Support 


Minimum support is used to prune the associations that are less frequent.

Minimum support = number of times item occur in transactions/ number of transactions

For example, lets say we define minimum support as 0.5.
Calculating support for egg is 1/4 = 0.25 (0.25 < 0.5), so that is eliminated. Support for beer is 3/4 = 0.75 (0.75 > 0.5) is it is considered for further processing.

Calculation of support for all items

size of the candidate itemset = 1

itemset: support
1: 0.25: eliminated
2: 0.75
3: 0.75
4: 0.5
5: 0.5
6: 0.5
7: 0.25: eliminated

remaining items: 2, 3, 4, 5, 6

extend candidate itemset by 1
size of the items = 2

itemset: support
2, 3: 0.75
2, 4: 0.25: eliminated
2, 5: 0.5
2, 6: 0.25: eliminated
3, 4: 0.25: eliminated
3, 5: 0.5
3, 6: 0.25: eliminated
4, 5: 0.25: eliminated
4, 6: 0.25: eliminated
5, 6: 0.25: eliminated

remaining items: {2,3},{ 2, 5}, {3, 5}

extend candidate itemset by 1
size of the items = 3

2, 3, 5: 0.5

Using FP-Growth algorithm


In FP-Growth algorithm, frequent patterns are mined using a tree approach (construction of Frequent Patter Tree)
FP-Growth algorithm has been proven to execute much faster than the Apriori algorithm.

Calculate support for frequent items and sort in degreasing order of the frequency as given below:

item: frequency
1: 1 - eliminated
2: 3
3: 3
4: 2
5: 2
6: 2
7: 1 - eliminated

Decreasing order of the frequency
2 (3), 3 (3), 4 (2), 5 (2), 6 (2)

Construction of FP-Tree

A) Transaction 1
 1, 2, 3, 5 > 2 (1), 3 (1), 5 (1)

B) Transaction 2
4, 5, 3, 2 > 2 (2), 3 (2), 4 (1), 5 (1)

C) Transaction 3
6, 7, 4 > 4 (1), 6 (1)

D) Transaction 4
6, 3, 2 > 2 (3), 3 (3), 6 (1)

Once the FP-tree is constructed, frequent item sets are calculated using depth first strategy along with divide and conquer mechanism.
This enables algorithm is be computationally more effective and parallelizable (using map-reduce).

Code Example with Apache Spark MLlib

    public static void main(String[] args) {
   
        SparkConf conf = new SparkConf().setAppName("Market Basket Analysis");
        JavaSparkContext sc = new JavaSparkContext(conf);
       
        // Items
        String item1 = "egg";
        String item2 = "beer";
        String item3 = "pampers";
        String item4 = "carrot";
        String item5 = "milk";
        String item6 = "perfume";
        String item7 = "tissues";
       
        // Transactions
        List transaction1 = new ArrayList();
        transaction1.add(item1);
        transaction1.add(item2);
        transaction1.add(item3);
        transaction1.add(item5);

        List transaction2 = new ArrayList();
        transaction2.add(item4);
        transaction2.add(item5);
        transaction2.add(item3);
        transaction2.add(item2);
       
        List transaction3 = new ArrayList();
        transaction3.add(item6);
        transaction3.add(item7);
        transaction3.add(item4);
       
        List transaction4 = new ArrayList();
        transaction4.add(item6);
        transaction4.add(item3);
        transaction4.add(item2);
       
        List> transactions = new ArrayList>();
        transactions.add(transaction1);
        transactions.add(transaction2);
        transactions.add(transaction3);
        transactions.add(transaction4);
       
        // Make transaction collection parallel with Spark
        JavaRDD> transactionsRDD = sc.parallelize(transactions);

        // Set configurations for FP-Growth
        FPGrowth fpg = new FPGrowth()
          .setMinSupport(0.5)
          .setNumPartitions(10);
       
        // Generate model
        FPGrowthModel model = fpg.run(transactionsRDD);

        // Display frequently co-occuring items
        for (FPGrowth.FreqItemset itemset: model.freqItemsets().toJavaRDD().collect()) {
           System.out.println("[" + Joiner.on(",").join(itemset.javaItems()) + "], " + itemset.freq());
        }
        sc.close();
    }

Saturday, July 25, 2015

Content Recommendations with PredictionIO

I’m currently working on content recommendations using PredictionIO mainly with its SimilarityProduct engine and Recommendation engine (given below)
  • https://docs.prediction.io/templates/similarproduct/quickstart/
  • https://docs.prediction.io/templates/recommendation/quickstart/
Before going into details about content recommendations, lets get an understanding about the under lying frameworks, technologies and languages used.

PredictionIO   

PredictionIO (https://prediction.io/) is an open source machine learning server which mainly facilitates  infrastructure support for diverse machine learning algorithms. It provides built-in rest apis/ SDKs to access machine learning algorithms in an operational machine learning environment.

PredictionIO has abstracted away the key steps in machine learning such as,
  •     reading data from data source (datasource)
  •     preparing data according to required input format (datapreparator)
  •     training data set (train)
  •     testing the model and querying (test)
   
It has provided a framework approach to use, implement or customise algorithms according to application needs.

PredictionIO is mostly based on Spark cluster computing framework.

Spark RDDs


Most of the algorithms are implemented with Scala, which is JVM based, functional programming language.
Algorithms are running on top of Spark, which is capable of distributing data as RDDs across a cluster of computers to parrellelize the computations performed on them.

RDD (Resilient Distributed Dataset) is the core abstraction of Spark, which is partitioned across the cluster to perform in-memory operations on top of them. Unlike the Hadoop Map Reduce programming paradigm, Spark RDDs are capable of caching intermediate results in memory (not storage) which makes them significantly faster.

Consequently, this approach is ideal for iterative machine learning algorithms such as K-means, gradient descent and ALS (Alternating Least Squares). Further, RDDs have optimised the fault tolerance mechanism in distributed setting by recomputing lost RDDs using instructions given as DAG (Direct Acyclic Graph) of transformations and actions. This eliminates the need for data replication.

Content Recommendations using ALS in MLlib


Content recommendation is a key area which PredictionIO is focused on. Currently, ALS (Alternating Least Squares) is supported in Spark MLlib scalable machine learning library.
Mlle. is built on Apache Spark for large scale data processing and it is a standard component in Spark.

why MLlib? - http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdf

A key goal of using machine learning in any application is to find hidden patterns or insights in large datasets.

When it comes to recommendations, If you take movie recommendation example, there can be latent factors (hidden factors) such as movie actors, genre, theme that affects user ratings and user preferences for movies. The idea in ALS algorithm is to uncover these latent factors using Matrix factorization in Linear algebra.

A set of user features (E.g., John likes Keenu Reeves movies) and product features (E.g., Keenu Reeves acts in Matrix movie) that can be the implicit reasons for user preferences is estimated mathematically based on known user preferences for given items to predict unseen/ unknown preferences for a given user.

The following link contains a good introduction with mathematical details about matrix factorisation. You can try out Python example as well.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/

The following link well explains the scalable in-memory matrix calculations of ALS in Spark MLlib.
http://www.slideshare.net/MrChrisJohnson/collaborative-filtering-with-spark

Given below are the key stages that your data will go through to provide useful item recommendations to users in PredictionIO.

Datasource

Hbase is one default backend for Event store/ input is taken as rest query [item/ user/ ratings]
metadata is stored in Elastic search
One example event imported:
{"eventId":"fBAiPFEGKPOdFsTrGK7VwwAAAU6RUP-Tl37BvsDRuHs","event":"rate","entityType":"user","entityId":"admin","targetEntityType":"item","targetEntityId":"6","properties":{"rating":5},"eventTime":"2015-07-15T10:44:41.491Z","creationTime":"2015-07-15T10:44:41.491Z"}

Datapreparator 

prepare input data for the ML algorithm in the format that is required, which in this case users, items and user ratings

Train

Once trained, the model is saved in local file system.
Training parameters for the ALS algorithm can be configured in Engine.json file as given below:
"algorithms": [

    {

      "name": "als",

      "params": {

        "rank": 10,

        "numIterations": 20,

        "lambda": 0.01,

        "seed": 3

      }

    }

  ]
  • rank - number of latent factors for model
  • lambda - regularization parameter in ALS (to avoid overfitting)
  • iterations - number of iterations

Querying (Test)

Once the model is trained recommendations for unseen user preferences can be requested via REST service

How to set up Apache Spark (Java) - MLlib in Eclipse?

Apache Spark version: 1.3.0

Download Apache Spark required pre-built version from the following link:
http://spark.apache.org/downloads.html

Create Maven project in Eclipse
File > New > Maven Project

Add following dependencies in pom.xml.



  org.apache.spark
  spark-mllib_2.10
  1.3.0
  provided
 

org.apache.spark
spark-core_2.10
1.3.0
provided


We have mentioned scope as “provided” as those dependancies are already available in Spark server.

Create new class and add you Java source code for required MLlib algorithm

Run as > Maven Build… > package

Verify .jar file is created in ‘target' folder of Maven project

Change the location to Spark installation you downloaded and unpacked and try following command:
./bin/spark-submit —class --master local[2]
E.g.,

./bin/spark-submit --class fpgrowth --master local[2] /Users/XXX/target/uber-TestMLlib-0.0.1-SNAPSHOT.jar

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib

Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.mllib.clustering.LDA.run(Lorg/apache/spark/api/java/JavaPairRDD;)Lorg/apache/spark/mllib/clustering/DistributedLDAModel

Spark version during the compilation time (in Maven repository) was different from runtime Spark version in Spark server class path (Spark installation directory/ lib)

Wednesday, May 6, 2015

Review of state of the art named entity recognition methods for different languages

1. Introduction

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organise information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

Due to the diverse language characteristics, NER can be declared as language/ domain specific task. For languages such as English and German, named entity recognition has been easier task when compared to asian languages due to beneficial orthographical features (E.g., nouns begin with a capital letter).

However, for most of the languages named entity recognition has been a challenging task due to lack of annotated corpora, complex morphological characteristics and homography.

Linguistic issues for South Asian languages are agglutinative nature, no capitalization, ambiguity, low POS tagging for accuracy, lack of good morphological analyzers, lack of named dictionaries, multiple ways of representing acronyms, free word order and spelling variation[2][4][9].

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

2. Different methods for NER

The current approaches for NER task can be categorized in to machine learning, rule based and hybrid methods.

2.1.  Machine learning based methods

Machine learning/ statistical methods require large annotated data. This is less expensive than rule based methods when it comes to maintenance aspects as less expert knowledge is required. Extension of new names is a costly task for machine learning as re-training is required.

Machine learning techniques consider NER task as a sequence tagging problem.

2.1.2. Algorithms for NER

            2.1.2.1. Supervised learning methods

Algorithm
Description
Conditional Random Fields (CRF) [2], [3], [6], [7], [9], [11], [13], [15]
Discriminative, Undirected graphical models, First order markov independence assumption, Conditional probability of labeled sequence

More efficient than HMM for non-independent, diverse overlapping features  of highly inflective languages

Framework: CRF++ [10]
Hidden Markov Models (HMM) [9]

Maximum Entropy (MaxEnt) [1]

Maximum Entropy Marcov Model (MEMM)

Support Vector Machines (SVM) [5]


            2.1.2.2. Semi-supervised learning methods

2.2. Rule based methods

Rule identification has to be done manually by linguistics and requires language specific knowledge. These methods include lexicalized grammar, gazetteer lists and list of trigger words. [2]

The rules generated for one language cannot be directly transferred to another language. Also, rule based methods do not perform well in ambiguous/ uncertain situations.

There can be positive and negative rules.

In [1], 36 rules are defined for time, measure and number classes. The rules contains corresponding entries for each language to act in language independent manner.  In addition, semi automatics extraction of context patterns is used to refine the accuracy.

[2] , [3] have used rule based method to find nested tags to improve recall.

Regular expressions have been utilized in [4][ 5] to identify person names and organization names. In [7], dictionaries are used to locate, if part of the word presents in the dictionary. 

Rule based NER engine is created using white list representing a dictionary of names and grammar in the form of regular expressions in [14]. Here, heuristic disambiguation technique is applied to get the correct choice when ambiguous situation arises.

2.3. Hybrid methods

It is specified in [1], [2] that hybrid system have been generally more effective in the task of NER with proven results.

In [1], a hybrid solution is suggested for NER which consist of base line NER system with MaxEnt model. To increase the performance, language specific rules and gazetteers are used. Further a set of rules have been applied to detect nested entities (E.g., district, town nested entities for location entity). Supported languages are Hindi, Bengali, Oriya, Telugu, Urdu.

Important findings of [1] suggests, if the available training set is small then using rule based methods can improve f-measure.

[2] has suggested machine learning based approach using Conditional Random Fields (CRFs) with feature induction and heuristics based rules as post processing mechanism for NER in South Asian languages.

Here, the tags which CRF has categorized as O (Other) are reconsidered for adherence to given rules and if confidence level exceeds a given threshold (E.g., 0.15) then the suggested tag is considered as the named entity instead of O. However, this approach has improved recall by 7% while causing slight decrease in precision (3%).

[3] has used hybrid approach for NER with CRFs, language rules and gazetteer lists.

CRF model is used with rule based methods in [4] for Telugu language.

In [8], 3 stage approach is suggested for NER task namely, use of NE dictionary, rules for named entity and left-right co-occurrence statistics. In the 3rd step, n-gram based named entity detection is performed. This approach is supervised method that relies in the co occurrence of left and right words.

CRF and HMM based hybrid approach is suggested in [9] for NER in Indian languages. It is concluded that when 2 statistical models are exploited, it gives better results than using only one approach.

[16] have used hybrid approach using 2 main steps. First, set of constraints are generated for each type such Person, Location and Organization. These are compiled by linguists and represented as FSA to generate most likely candidates. Then, these candidates will be assigned class probability and generative class model is created based on this. Transliteration is used to identify foreign names.

            2.4 Referencing gazetteer lists

This is most simple and fastest method of named entity recognition. However, since named entities are numerous and constantly evolving, this approach itself has not been sufficient for effective NER task. However, in [5] it is found that incorporating gazetteer list can significantly improve the performance.

Gazetteer lists has been created in [1]  using transliteration for month names, days, common locations, first name, middle name, last name etc.

[2] has used gazetteer lists of list of measures (kilogram, lacks), numerals and quantifiers (first, second) and time expressions (date, month, minutes, hours).

2.5 Other methods

A phonetic matching technique is harnessed in [12] for NER in Indian languages on the basis of similar sounding property. They have used Stanford NER as the reference entity database and have come up with a Hindi named entity database using a phonetic matcher.

In [17], external resources such as Wikipedia infobox features are used to infer entity name along with word clustering algorithm to partition words into classes based on their co occurrence statistics in a large corpera.

3. Feature Selection

When it comes to feature selection, available word and tag context plays a major role. Many systems seems to use binary features which represents the presence or absence of a given property of a word.

Static words (previous and next words), context lists (frequent words in a given window for a particular class, E.g., Location class: city, going to), dynamic NE tag (NE tag for previous word), first word, contains digit, numerical characters, affixes (word suffix, word prefix), root information of word, Part of Speech (POS) tag are used as features in [1] with MaxEnt model.

It is highlighted in [1] that window of (w-2, w+2) gives the best results. Further, it is evident in [1] that usage of complex feature set does not guarantee better result.

[2] has used language independent features such as window of the words (window size 5), statistical suffixes for person and location entities (extracted as lists and used as binary feature), prefixes (to avoid agglutinative nature/ postpositions), start of sentence and presence of digits.

Prefix and suffix information is used as features in [3] as Indian languages are highly inflected (window size 5). In addition, previous word tags, rare word (most frequent words in language) and POS tags are used. Here, Oriya, Urdu and Telingu languages have shown poor performance when compared to Hindi and Bengali due to poor language features.

In [4], “majority tag” is used as an additional feature, which uses contextual and frequency information of other tags that are literally similar, to label an unnamed tag.

In experiment results of [5], it is highlighted that [-3, +2] window size gives the optimal results and increasing the window size has decreased the f-measure.

4. Recognizing different entity types


Entity type
Method
Challenges
Person name
look up procedure, analyse local lexical context, looking at part of sequence of candidate words (name component)

Features: POS tags, capitalization, decimal digits, bag of words,

Left and right context

Token legth
Name variations (same person referred in different names) - reuse of name parts, morphological variants prefixes etc., transliteration differences

Person name can be proper noun 
Organization
Use organization specific candidate words
Various ways of representing abbriviations
Place
Using gazetteer, trigger words (E.g., Newyork city)
Homographic with common names, historical variants, exonyms (foreign variants), endonyms (local variants)


5. Summary and Conclusion  

CRF based/ Hybrid/ Chain of named entity recognizers/ Rule based methods as post processing mechanism

5. References

[1] A Hybrid Approach for Named Entity Recognition in Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[2] Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[3] Language Independent Named Entity Recognition in Indian Language: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[4] Named Entity Recognition for Telugu: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[5] Bengali Named Entity Recognition using Support Vector Machine: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[6] Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[7] A Character n-gram Based Approach for Improved Recall in Indian Language NER: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[8] An Experiment on Automatic Detection of Named Entities in Bangla: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[9] A Hybrid Named Entity Recognition System for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[10] CRF++: Yet Another CRF toolkit: http://crfpp.googlecode.com/svn/trunk/doc/index.html

[11] Named Entity Recognition for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[12] Named Entity Recognition for Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[13] Experiments in Telugu NER: A Conditional Random Field Approach: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[14] NERA: Named Entity Recognition for Arabic: Journal of the American Society for Information Science and Technology: Volume 60 Issue 8, August 2009 Pages 1652-1663

[15] Integrated Machine Learning Techniques for Arabic Named Entity Recognition: International Journal of Computer Science Issues (IJCSI) . Jul2010, Vol. 7 Issue 4, p27-36. 10p. 2 Charts, 11 Graphs.

[16] Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach: Microsoft Research - China

[17] A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters




Friday, May 1, 2015

Warning: Connection timeout. Retrying... in Vagrant

I got this error when trying to ssh connect to Vagrant instance.

slave01: SSH auth method: private key
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying...
    slave01: Warning: Connection timeout. Retrying…

This is how I resolved that.

Check the running VMs using the command below:
vboxmanage list runningvms

"Cluster_master_1425452823416_28694" {5b4505cb-2b48-45c1-a941-8e4b2d36f29b}
"Cluster_slave01_1425452852606_67792" {740aa005-6666-43e0-9631-a2f2d6e4caa5}
"Cluster_slave02_1425452882392_41607" {888cf959-1978-4474-86ef-ddfaaf3491f5}

Provide the required keyboard input (enter in this case) using following command:
vboxmanage controlvm Cluster_slave01_1425452852606_67792 keyboardputscancode 1c

Progress state: NS_ERROR_FAILURE in Vagrant

I got the following error when trying to start Vagrant instance.

There was an error while executing `VBoxManage`, a CLI used by Vagrant

for controlling VirtualBox. The command and stderr is shown below.

Command: ["hostonlyif", "create"]

Stderr: 0%...

Progress state: NS_ERROR_FAILURE

VBoxManage: error: Failed to create the host-only adapter

VBoxManage: error: VBoxNetAdpCtl: Error while adding new interface: failed to open /dev/vboxnetctl: No such file or directory

VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component HostNetworkInterface, interface IHostNetworkInterface

VBoxManage: error: Context: "int handleCreate(HandlerArg*, int, int*)" at line 68 of file VBoxManageHostonly.cpp

I could get this resolved by following action.

I have set up Vagrant cluster and there was one VM that is not running on Vagrant opened and running in VirtualBox. Issue might be due to that as well.

Restart Virtualbox using the following command:

sudo /Library/StartupItems/VirtualBox/VirtualBox restart

Monday, March 30, 2015

org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Connection refused

I got the following error (in Hadoop user logs) while trying to run Mahout map reduce job in Hadoop (fully distribution mode):

2015-03-25 08:31:52,858 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From slave01.net/127.0.1.1 to slave01.net:60926 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
    at org.apache.hadoop.ipc.Client.call(Client.java:1472)
    at org.apache.hadoop.ipc.Client.call(Client.java:1399)
    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:244)
    at com.sun.proxy.$Proxy7.getTask(Unknown Source)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:132)
Caused by: java.net.ConnectException: Connection refused
    at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
    at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
    at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
    at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
    at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:607)
    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:705)
    at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
    at org.apache.hadoop.ipc.Client.call(Client.java:1438)

I could solve this issue by,
Replacing 127.0.1.1 host name mapping to permanent IP as given below:
33.33.33.10      master

Sunday, March 15, 2015

java.io.IOException: Incompatible clusterIDs in /home/user/hadoop/data

I encountered this issue when I added a new data node later for an already created Hadoop cluster.

Problem:
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master/33.33.33.10:9000. Exiting.
java.io.IOException: Incompatible clusterIDs in /home/huser/hadoop/data: namenode clusterID = CID-8019e6e9-73d7-409c-a241-b57e9534e6fe; datanode clusterID = CID-bcc9c537-54dc-4329-bf63-448037976f75
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:646)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:320)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:403)
    at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:422)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1311)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1276)
    at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:220)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:828)


Solution:
The issue seems to be due to some version/ metadata mismatch problem.  I followed the steps given below to solve the issue:
  1. Delete the directory listed as dfs.datanode.data.dir/ dfs.namenode.name.dir configuration in hdfs-site.xml
  2. Delete tmp/hadoop-hduser directory
  3. Re-format the name node using following command
./hdfs namenode -format




Issue when starting Hadoop cluster


Problem:
have: ssh: Could not resolve hostname have: Name or service not known

warning:: ssh: Could not resolve hostname warning:: Name or service not known

guard: ssh: Could not resolve hostname guard: Name or service not known

VM: ssh: Could not resolve hostname VM: Name or service not known

you: ssh: Could not resolve hostname you: Name or service not known

You: ssh: Could not resolve hostname You: Name or service not known

fix: ssh: Could not resolve hostname fix: Name or service not known

Client: ssh: Could not resolve hostname Client: Name or service not known

'execstack: ssh: Could not resolve hostname 'execstack: Name or service not known

might: ssh: Could not resolve hostname might: Name or service not known

HotSpot(TM): ssh: Could not resolve hostname HotSpot(TM): Name or service not known

',: ssh: Could not resolve hostname ',: Name or service not known

VM: ssh: Could not resolve hostname VM: Name or service not known

or: ssh: Could not resolve hostname or: Name or service not known

disabled: ssh: Could not resolve hostname disabled: Name or service not known

loaded: ssh: Could not resolve hostname loaded: Name or service not known

recommended: ssh: Could not resolve hostname recommended: Name or service not known

which: ssh: Could not resolve hostname which: Name or service not known

fix: ssh: Could not resolve hostname fix: Name or service not known

now.: ssh: Could not resolve hostname now.: Name or service not known
the: ssh: Could not resolve hostname the: Name or service not known

that: ssh: Could not resolve hostname that: Name or service not known

guard.: ssh: Could not resolve hostname guard.: Name or service not known

will: ssh: Could not resolve hostname will: Name or service not known

have: ssh: Could not resolve hostname have: Name or service not known

library: ssh: Could not resolve hostname library: Name or service not known

library: ssh: Could not resolve hostname library: Name or service not known

stack: ssh: Could not resolve hostname stack: Name or service not known

The: ssh: Could not resolve hostname The: Name or service not known

try: ssh: Could not resolve hostname try: Name or service not known

the: ssh: Could not resolve hostname the: Name or service not known

link: ssh: Could not resolve hostname link: Name or service not known

highly: ssh: Could not resolve hostname highly: Name or service not known

It's: ssh: Could not resolve hostname It's: Name or service not known

with: ssh: Could not resolve hostname with: Name or service not known

stack: ssh: Could not resolve hostname stack: Name or service not known

Java: ssh: Could not resolve hostname Java: Name or service not known

with: ssh: Could not resolve hostname with: Name or service not known

it: ssh: Could not resolve hostname it: Name or service not known

noexecstack'.: ssh: Could not resolve hostname noexecstack'.: Name or service not known

'-z: ssh: Could not resolve hostname '-z: Name or service not known


Solution:

Add the following to .bashrc file

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"




java.io.IOException: Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.

Issue: One data node in Hadoop cluster dies after starting

Problem:
java.io.IOException: Incorrect configuration: namenode address dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured.
    at org.apache.hadoop.hdfs.DFSUtil.getNNServiceRpcAddressesForCluster(DFSUtil.java:866)
    at org.apache.hadoop.hdfs.server.datanode.BlockPoolManager.refreshNamenodes(BlockPoolManager.java:155)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1074)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:415)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2268)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2155)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2202)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:2378)
    at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:2402)
2015-03-16 05:34:17,953 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2015-03-16 05:34:17,959 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 

Solution:
Applied the following configuration in core-xml of data node


fs.defaultFS
hdfs://master:9000

Thursday, March 12, 2015

How to setup a private network in Vagrant?

Follow the steps given below to setup a private network in Vagrant.
  • Specify the node names (node 1, node 2 etc) and their static IPs (any preferred IP) in the Vagrant configuration file as given below:

Vagrant.configure("2") do |config|
  config.vm.provision "shell", inline: "echo Hello"

  config.vm.define "master" do |master|
    master.vm.box = "hashicorp/precise32"
    master.vm.network :private_network, ip: "33.33.33.10"
  end

  config.vm.define “node01" do | node01|
    node01.vm.box = "hashicorp/precise32"
   node01.vm.network :private_network, ip: "33.33.33.11"
  end

  config.vm.define "node02" do |node02|
    node02.vm.box = "hashicorp/precise32"
    node02.vm.network :private_network, ip: "33.33.33.12"
  end
end

  • Start and initialise new Vagrant instances

vagrant init

  • You can ssh to each instance by their names
Example:
vagrant ssh node01

Tuesday, March 10, 2015

Issues with Vi editor

  1. Create .vimrc in your home directory
  2. Insert the following content and save:
set nocompatible
set backspace=2

If you use,
sudo vi filename,  above option won't work.

Then you should use,
sudoedit filename

How to scp Vagrant?

from VM to local:

vagrant scp (name of the Vagrant environment:Vagrant path in the box) (local folder path - from the directory from which Vagrant is loaded)

vagrant scp default:/home/vagrant/jayani jayani

from local to VM:
vagrant scp  (local folder path - from the directory from which Vagrant is loaded) (name of the Vagrant environment:Vagrant path in the box)

vagrant scp jayani /home/vagrant

if you have multiple Vagrant environment precede with the name of the Vagrant environment.
vagrant scp master:jayani /home/vagrant

How to find the name of the Vagrant environment?

execute the following command:
vagrant global-status

id       name    provider   state   directory                                   
---------------------------------------------------------------------------------
894ce03  default virtualbox running /Users/jwithanawasam/
cbb0745  master  virtualbox running /Users/jwithanawasam/
c296e12  slave01 virtualbox running /Users/jwithanawasam/
3c6e261  slave02 virtualbox running /Users/jwithanawasam/

VBoxManage: error: Failed to create the host-only adapter VBoxManage: error: VBoxNetAdpCtl: Error while adding new interface: failed to open

After configuring some network settings as given below, I got the following error. I have mentioned how I resolved it here.

master.vm.network :private_network, ip: "33.33.33.10"
Vagrant up or Vagrant reload command

Error:

There was an error while executing `VBoxManage`, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.

Command: ["hostonlyif", "create"]

Stderr: 0%...
Progress state: NS_ERROR_FAILURE
VBoxManage: error: Failed to create the host-only adapter
VBoxManage: error: VBoxNetAdpCtl: Error while adding new interface: failed to open /dev/vboxnetctl: No such file or directory

VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component HostNetworkInterface, interface IHostNetworkInterface
VBoxManage: error: Context: "int handleCreate(HandlerArg*, int, int*)" at line 68 of file VBoxManageHostonly.cpp

Solution:

  1. First power off all the VMs running in virtual box
  2. Then run the following command: (For Mac)
sudo /Library/StartupItems/VirtualBox/VirtualBox restart
  1. Then start required VMs

Tuesday, February 10, 2015

java.io.IOException: No FileSystem for scheme: HDFS

To solve the above issue, add the following to hadoop-2.6.0/etc/hadoop/core-site.xml:

  fs.hdfs.impl
  org.apache.hadoop.hdfs.DistributedFileSystem
  The FileSystem for hdfs: uris.