Dev 007: Review of state of the art named entity recognition methods for different languages

1. Introduction

Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organise information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

Due to the diverse language characteristics, NER can be declared as language/ domain specific task. For languages such as English and German, named entity recognition has been easier task when compared to asian languages due to beneficial orthographical features (E.g., nouns begin with a capital letter).

However, for most of the languages named entity recognition has been a challenging task due to lack of annotated corpora, complex morphological characteristics and homography.

Linguistic issues for South Asian languages are agglutinative nature, no capitalization, ambiguity, low POS tagging for accuracy, lack of good morphological analyzers, lack of named dictionaries, multiple ways of representing acronyms, free word order and spelling variation[2][4][9].

In NER tasks frequently detected entities are Person, Location, Organization, Time, Currency , Percentage, Phone number, and ISBN.

2. Different methods for NER

The current approaches for NER task can be categorized in to machine learning, rule based and hybrid methods.

2.1. Machine learning based methods

Machine learning/ statistical methods require large annotated data. This is less expensive than rule based methods when it comes to maintenance aspects as less expert knowledge is required. Extension of new names is a costly task for machine learning as re-training is required.

Machine learning techniques consider NER task as a sequence tagging problem.

2.1.2. Algorithms for NER

2.1.2.1. Supervised learning methods

Algorithm	Description
Conditional Random Fields (CRF) [2], [3], [6], [7], [9], [11], [13], [15]	Discriminative, Undirected graphical models, First order markov independence assumption, Conditional probability of labeled sequence More efficient than HMM for non-independent, diverse overlapping features of highly inflective languages Framework: CRF++ [10]
Hidden Markov Models (HMM) [9]
Maximum Entropy (MaxEnt) [1]
Maximum Entropy Marcov Model (MEMM)
Support Vector Machines (SVM) [5]

2.1.2.2. Semi-supervised learning methods

2.2. Rule based methods

Rule identification has to be done manually by linguistics and requires language specific knowledge. These methods include lexicalized grammar, gazetteer lists and list of trigger words. [2]

The rules generated for one language cannot be directly transferred to another language. Also, rule based methods do not perform well in ambiguous/ uncertain situations.

There can be positive and negative rules.

In [1], 36 rules are defined for time, measure and number classes. The rules contains corresponding entries for each language to act in language independent manner. In addition, semi automatics extraction of context patterns is used to refine the accuracy.

[2] , [3] have used rule based method to find nested tags to improve recall.

Regular expressions have been utilized in [4][ 5] to identify person names and organization names. In [7], dictionaries are used to locate, if part of the word presents in the dictionary.

Rule based NER engine is created using white list representing a dictionary of names and grammar in the form of regular expressions in [14]. Here, heuristic disambiguation technique is applied to get the correct choice when ambiguous situation arises.

2.3. Hybrid methods

It is specified in [1], [2] that hybrid system have been generally more effective in the task of NER with proven results.

In [1], a hybrid solution is suggested for NER which consist of base line NER system with MaxEnt model. To increase the performance, language specific rules and gazetteers are used. Further a set of rules have been applied to detect nested entities (E.g., district, town nested entities for location entity). Supported languages are Hindi, Bengali, Oriya, Telugu, Urdu.

Important findings of [1] suggests, if the available training set is small then using rule based methods can improve f-measure.

[2] has suggested machine learning based approach using Conditional Random Fields (CRFs) with feature induction and heuristics based rules as post processing mechanism for NER in South Asian languages.

Here, the tags which CRF has categorized as O (Other) are reconsidered for adherence to given rules and if confidence level exceeds a given threshold (E.g., 0.15) then the suggested tag is considered as the named entity instead of O. However, this approach has improved recall by 7% while causing slight decrease in precision (3%).

[3] has used hybrid approach for NER with CRFs, language rules and gazetteer lists.

CRF model is used with rule based methods in [4] for Telugu language.

In [8], 3 stage approach is suggested for NER task namely, use of NE dictionary, rules for named entity and left-right co-occurrence statistics. In the 3rd step, n-gram based named entity detection is performed. This approach is supervised method that relies in the co occurrence of left and right words.

CRF and HMM based hybrid approach is suggested in [9] for NER in Indian languages. It is concluded that when 2 statistical models are exploited, it gives better results than using only one approach.

[16] have used hybrid approach using 2 main steps. First, set of constraints are generated for each type such Person, Location and Organization. These are compiled by linguists and represented as FSA to generate most likely candidates. Then, these candidates will be assigned class probability and generative class model is created based on this. Transliteration is used to identify foreign names.

2.4 Referencing gazetteer lists

This is most simple and fastest method of named entity recognition. However, since named entities are numerous and constantly evolving, this approach itself has not been sufficient for effective NER task. However, in [5] it is found that incorporating gazetteer list can significantly improve the performance.

Gazetteer lists has been created in [1] using transliteration for month names, days, common locations, first name, middle name, last name etc.

[2] has used gazetteer lists of list of measures (kilogram, lacks), numerals and quantifiers (first, second) and time expressions (date, month, minutes, hours).

2.5 Other methods

A phonetic matching technique is harnessed in [12] for NER in Indian languages on the basis of similar sounding property. They have used Stanford NER as the reference entity database and have come up with a Hindi named entity database using a phonetic matcher.

In [17], external resources such as Wikipedia infobox features are used to infer entity name along with word clustering algorithm to partition words into classes based on their co occurrence statistics in a large corpera.

3. Feature Selection

When it comes to feature selection, available word and tag context plays a major role. Many systems seems to use binary features which represents the presence or absence of a given property of a word.

Static words (previous and next words), context lists (frequent words in a given window for a particular class, E.g., Location class: city, going to), dynamic NE tag (NE tag for previous word), first word, contains digit, numerical characters, affixes (word suffix, word prefix), root information of word, Part of Speech (POS) tag are used as features in [1] with MaxEnt model.

It is highlighted in [1] that window of (w-2, w+2) gives the best results. Further, it is evident in [1] that usage of complex feature set does not guarantee better result.

[2] has used language independent features such as window of the words (window size 5), statistical suffixes for person and location entities (extracted as lists and used as binary feature), prefixes (to avoid agglutinative nature/ postpositions), start of sentence and presence of digits.

Prefix and suffix information is used as features in [3] as Indian languages are highly inflected (window size 5). In addition, previous word tags, rare word (most frequent words in language) and POS tags are used. Here, Oriya, Urdu and Telingu languages have shown poor performance when compared to Hindi and Bengali due to poor language features.

In [4], “majority tag” is used as an additional feature, which uses contextual and frequency information of other tags that are literally similar, to label an unnamed tag.

In experiment results of [5], it is highlighted that [-3, +2] window size gives the optimal results and increasing the window size has decreased the f-measure.

4. Recognizing different entity types

Entity type	Method	Challenges
Person name	look up procedure, analyse local lexical context, looking at part of sequence of candidate words (name component) Features: POS tags, capitalization, decimal digits, bag of words, Left and right context Token legth	Name variations (same person referred in different names) - reuse of name parts, morphological variants prefixes etc., transliteration differences Person name can be proper noun
Organization	Use organization specific candidate words	Various ways of representing abbriviations
Place	Using gazetteer, trigger words (E.g., Newyork city)	Homographic with common names, historical variants, exonyms (foreign variants), endonyms (local variants)

5. Summary and Conclusion

CRF based/ Hybrid/ Chain of named entity recognizers/ Rule based methods as post processing mechanism

5. References

[1] A Hybrid Approach for Named Entity Recognition in Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[2] Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[3] Language Independent Named Entity Recognition in Indian Language: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[4] Named Entity Recognition for Telugu: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[5] Bengali Named Entity Recognition using Support Vector Machine: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[6] Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[7] A Character n-gram Based Approach for Improved Recall in Indian Language NER: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[8] An Experiment on Automatic Detection of Named Entities in Bangla: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[9] A Hybrid Named Entity Recognition System for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[10] CRF++: Yet Another CRF toolkit: http://crfpp.googlecode.com/svn/trunk/doc/index.html

[11] Named Entity Recognition for South Asian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[12] Named Entity Recognition for Indian Languages: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[13] Experiments in Telugu NER: A Conditional Random Field Approach: NER for South and South East Asian Languages: IJCNLP-08 Workshop: 2008

[14] NERA: Named Entity Recognition for Arabic: Journal of the American Society for Information Science and Technology: Volume 60 Issue 8, August 2009 Pages 1652-1663

[15] Integrated Machine Learning Techniques for Arabic Named Entity Recognition: International Journal of Computer Science Issues (IJCSI) . Jul2010, Vol. 7 Issue 4, p27-36. 10p. 2 Charts, 11 Graphs.

[16] Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach: Microsoft Research - China

[17] A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters

Dev 007

Pages

Wednesday, May 6, 2015

Review of state of the art named entity recognition methods for different languages

No comments:

Post a Comment