Pages

Saturday, February 22, 2014

Named Entity Recognition using Conditional Random Fields (CRF)

Named Entity Recognition

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organize information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

E.g., When translating "Sinhala text to English text", we need to figure out what are person names, locations in that text, so that we can avoid the overhead of finding corresponding English meaning for them. This is also helpful in question answering scenarios such as "Where Enrique was born?"

Different methods such as rule based systems, statistical and gazetteers have been used in NER task, however, statistical approaches have been more prominent and other methods are used to refine the results as post processing mechanism.

In computational statistics, NER has been identified as sequence labeling task and Conditional Random Fields (CRF ) has been successfully used to implement this.

In this article I will use CRF++ to explain how to implement a named entity recognizer using a simple example.

Consider the following input sentence:
"Enrique was born in Spain"

Now, by looking at this sentence any human can understand that Spain is a Location. But machines are unable to do so without previous learning.

So, to learn the computer we need to identify a set of features that links the aspects of what we observe in this sentence with the class we want to predict, which in this case "Location". How can we do that?

Considering the token/ word "Spain" itself is not sufficient to decide that it is a location in a generic manner. So, we consider its "context" as well which includes the previous/ next word, it's POS tag, previous NE tag etc. to infer the NE tag for token "Spain".

Feature Template  and Training dataset

In this example, I will use "previous word" as the feature. So, we will define this in feature template as given below:

# Unigram
U00:%x[-1,0]

# Bigram
B

U00 is unique id to identify the feature. 

I will explain %x[row, column]  using the following sentence that we going to train the model.
I live in Colombo

First, we need to define the sentence according to the following format. (training.data)
I O
live O
in O
Colombo Location

current word: Colombo
-1: in
0: first column (Here, I have given only one column. But new columns are added when we define more features such as POS tag)
In the above training file last column refers to the answers we give to model NE task.

So, this feature indicates the model that after the word "in", it is "likely" to find a "Location".
Now we train the model:

crf_learn template train.data model

Model file is generated using feature template and the training data file.

Inference

Now we need to know if the following sentence has any important entities such as Location.
"Enrique was born in Spain"

We need to format input file also according to the above format. (test.data)
Enrique
was
born
in
Spain

Now we use the following command to test the model.

crf_test  -m model test.data

Outcome would be the following:

Enrique O
was O
born O
in O
Spain Location

Likewise, the model will give predictions on entities present in the input files based on the given features and available training data.

Note: Check the Unicode compatibility for different languages. E.g., for Sinhala Language it's UTF-7.

Coming up next...
  • Probabilistic Graphic Models
  • Conditional Probability 
  • Finite State Automata
  • First order markov independence assumption

Source code:
https://bitbucket.org/jaywith/sinhala-named-entity-recognition

Jayani Withanawasam

No comments:

Post a Comment