Dev 007: February 2014

Saturday, February 22, 2014

Named Entity Recognition using Conditional Random Fields (CRF)

Named Entity Recognition

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organize information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

E.g., When translating "Sinhala text to English text", we need to figure out what are person names, locations in that text, so that we can avoid the overhead of finding corresponding English meaning for them. This is also helpful in question answering scenarios such as "Where Enrique was born?"

Different methods such as rule based systems, statistical and gazetteers have been used in NER task, however, statistical approaches have been more prominent and other methods are used to refine the results as post processing mechanism.

In computational statistics, NER has been identified as sequence labeling task and Conditional Random Fields (CRF ) has been successfully used to implement this.

In this article I will use CRF++ to explain how to implement a named entity recognizer using a simple example.

Consider the following input sentence:
"Enrique was born in Spain"

Now, by looking at this sentence any human can understand that Spain is a Location. But machines are unable to do so without previous learning.

So, to learn the computer we need to identify a set of features that links the aspects of what we observe in this sentence with the class we want to predict, which in this case "Location". How can we do that?

Considering the token/ word "Spain" itself is not sufficient to decide that it is a location in a generic manner. So, we consider its "context" as well which includes the previous/ next word, it's POS tag, previous NE tag etc. to infer the NE tag for token "Spain".

Feature Template and Training dataset

In this example, I will use "previous word" as the feature. So, we will define this in feature template as given below:

# Unigram
U00:%x[-1,0]

# Bigram
B

U00 is unique id to identify the feature.

I will explain %x[row, column] using the following sentence that we going to train the model.
I live in Colombo

First, we need to define the sentence according to the following format. (training.data)
I O
live O
in O
Colombo Location

current word: Colombo
-1: in
0: first column (Here, I have given only one column. But new columns are added when we define more features such as POS tag)
In the above training file last column refers to the answers we give to model NE task.

So, this feature indicates the model that after the word "in", it is "likely" to find a "Location".
Now we train the model:

crf_learn template train.data model

Model file is generated using feature template and the training data file.

Inference

Now we need to know if the following sentence has any important entities such as Location.
"Enrique was born in Spain"

We need to format input file also according to the above format. (test.data)
Enrique
was
born
in
Spain

Now we use the following command to test the model.

crf_test -m model test.data

Outcome would be the following:

Enrique O
was O
born O
in O
Spain Location

Likewise, the model will give predictions on entities present in the input files based on the given features and available training data.

Note: Check the Unicode compatibility for different languages. E.g., for Sinhala Language it's UTF-7.

Coming up next...

Probabilistic Graphic Models
Conditional Probability
Finite State Automata
First order markov independence assumption

Source code:
https://bitbucket.org/jaywith/sinhala-named-entity-recognition

Jayani Withanawasam

Sunday, February 2, 2014

Better approach to load resources using relative paths in Java

FileInputStream (Absolute path)

To load a resource file such as x.properties for program use, first thing that we would consider will be specifying the absolute file path as given below:

InputStream input = new FileInputStream("/Users/jwithanawasam/some_dir/src/main/resources/
config.properties”);

However, when ever we moved the project to another location, this path has to be changed, which is not acceptable.

FileInputStream (Relative path)

So, the next option would be to use the relative file path as given below, instead of giving absolute file path:

InputStream input = new FileInputStream("src/main/resources/config.properties”);

This approach seems to solve the above mentioned concern.

However, problem with this is the relative path is depending on the current working directory, which JVM is started. In this scenario, it is "/Users/jwithanawasam/some_dir". But, in a different deployment setting this may change, which leads to change the given relative path accordingly. Moreover, we, as developers do not have much control over JVMs current working directory.

In any of the above cases, we will get java.io.FileNotFoundException error, which is a familiar exception for most java developers.

class.getResourceAsStream

At runtime, JVM checks the class path to locate any user defined classes and packages. (In Maven, build artifacts and dependancies are stored under path given for M2_REPO class path variable. E.g., /Users/jwithanawasam/.m2/repository) The .jar file which is the deployable unit of the project will be located here.

JVM uses class loader to load java libraries specified in class path.

So, best thing we can do is load the resource specifying a path relative to its class path using class loader. Here, specified relative path will work irrespective of the actual disk location the package is deployed.

Following methods reads the file using class loader.

InputStream input = Test.class.getResourceAsStream("/config.properties");

Usually, in Java projects resources such as configuration files, images etc. are located in src/main/resources/ path. So, if we add a resource immediately inside this folder, during packaging, the file will be located in the immediate folder in .jar file.

We can verify this using the following command to extract content of jar file:

jar xf someproject.jar

If you place the resources in another sub folder, then you have to specify the path relative to src/main/resources/ path.

So, using this approach we can load resources using relative paths in a hard disk location independent manner. Once we package the application, it is ready to be deployed anywhere, as it it is, without the overhead of having to validate resource file paths, thus improving the portability of the application.

ServletContext.getResourceAsStream for web applications

For web applications, use the following method:

ServletContext context = getServletContext();
    InputStream is = context.getResourceAsStream("/filename.txt");

Here, file path is taken relative to your web application folder. (The unzipped version of the .war file)

E.g., mywebapplication.war (unzipped) will have a hierarchy similar to the following.

mywebapplication

    META-INF

    WEB-INF

        classes

   filename.txt

So, "/" means the root of this web application folder.

This method allows servlet containers to make a resource available to a servlet from any location, without using a class loader.

Pages