1. Introduction
Name Entity
Recognition (NER) is a significant method for extracting structured information
from unstructured text and organise information in a semantically accurate form
for further inference and decision making.
NER has been a
key pre-processing step for most of the natural language processing
applications such as information extraction, machine translation, information
retrieval, topic detection, text summarization and automatic question answering
tasks.
Due to the
diverse language characteristics, NER can be declared as language/ domain
specific task. For languages such as English and German, named entity
recognition has been easier task when compared to asian languages due to
beneficial orthographical features (E.g., nouns begin with a capital letter).
However, for
most of the languages named entity recognition has been a challenging task due
to lack of annotated corpora, complex morphological characteristics and
homography.
Linguistic
issues for South Asian languages are agglutinative nature, no capitalization,
ambiguity, low POS tagging for accuracy, lack of good morphological analyzers,
lack of named dictionaries, multiple ways of representing acronyms, free word
order and spelling variation[2][4][9].
In NER tasks
frequently detected entities are Person, Location, Organization, Time, Currency
, Percentage, Phone number, and ISBN.
2. Different methods for NER
The current approaches for NER task
can be categorized in to machine learning, rule based and hybrid methods.
2.1.
Machine learning based methods
Machine
learning/ statistical methods require large annotated data. This is less
expensive than rule based methods when it comes to maintenance aspects as less
expert knowledge is required. Extension of new names is a costly task for
machine learning as re-training is required.
Machine learning
techniques consider NER task as a sequence tagging problem.
2.1.2. Algorithms for NER
2.1.2.1. Supervised learning methods
Algorithm
|
Description
|
Conditional
Random Fields (CRF) [2], [3], [6], [7], [9], [11], [13], [15]
|
Discriminative,
Undirected graphical models, First order markov independence assumption,
Conditional probability of labeled sequence
More efficient
than HMM for non-independent, diverse overlapping features of highly inflective languages
Framework:
CRF++ [10]
|
Hidden Markov
Models (HMM) [9]
|
|
Maximum
Entropy (MaxEnt) [1]
|
|
Maximum
Entropy Marcov Model (MEMM)
|
|
Support Vector
Machines (SVM) [5]
|
2.1.2.2. Semi-supervised learning methods
2.2. Rule based methods
Rule
identification has to be done manually by linguistics and requires language
specific knowledge. These methods include lexicalized grammar, gazetteer lists
and list of trigger words. [2]
The rules
generated for one language cannot be directly transferred to another language.
Also, rule based methods do not perform well in ambiguous/ uncertain
situations.
There can be
positive and negative rules.
In [1], 36 rules
are defined for time, measure and number classes. The rules contains
corresponding entries for each language to act in language independent
manner. In addition, semi automatics
extraction of context patterns is used to refine the accuracy.
[2] , [3] have
used rule based method to find nested tags to improve recall.
Regular
expressions have been utilized in [4][ 5] to identify person names and
organization names. In [7], dictionaries are used to locate, if part of the
word presents in the dictionary.
Rule based NER
engine is created using white list representing a dictionary of names and
grammar in the form of regular expressions in [14]. Here, heuristic
disambiguation technique is applied to get the correct choice when ambiguous
situation arises.
2.3. Hybrid methods
It is specified
in [1], [2] that hybrid system have been generally more effective in the task
of NER with proven results.
In [1], a hybrid
solution is suggested for NER which consist of base line NER system with MaxEnt
model. To increase the performance, language specific rules and gazetteers are
used. Further a set of rules have been applied to detect nested entities (E.g.,
district, town nested entities for location entity). Supported languages are
Hindi, Bengali, Oriya, Telugu, Urdu.
Important
findings of [1] suggests, if the available training set is small then using
rule based methods can improve f-measure.
[2] has
suggested machine learning based approach using Conditional Random Fields
(CRFs) with feature induction and heuristics based rules as post processing
mechanism for NER in South Asian languages.
Here, the tags
which CRF has categorized as O (Other) are reconsidered for adherence to given
rules and if confidence level exceeds a given threshold (E.g., 0.15) then the
suggested tag is considered as the named entity instead of O. However, this
approach has improved recall by 7% while causing slight decrease in precision
(3%).
[3] has used
hybrid approach for NER with CRFs, language rules and gazetteer lists.
CRF model is
used with rule based methods in [4] for Telugu language.
In [8], 3 stage
approach is suggested for NER task namely, use of NE dictionary, rules for
named entity and left-right co-occurrence statistics. In the 3rd step, n-gram
based named entity detection is performed. This approach is supervised method
that relies in the co occurrence of left and right words.
CRF and HMM
based hybrid approach is suggested in [9] for NER in Indian languages. It is
concluded that when 2 statistical models are exploited, it gives better results
than using only one approach.
[16] have used
hybrid approach using 2 main steps. First, set of constraints are generated for
each type such Person, Location and Organization. These are compiled by
linguists and represented as FSA to generate most likely candidates. Then,
these candidates will be assigned class probability and generative class model
is created based on this. Transliteration is used to identify foreign names.
2.4 Referencing gazetteer lists
This is most
simple and fastest method of named entity recognition. However, since named
entities are numerous and constantly evolving, this approach itself has not
been sufficient for effective NER task. However, in [5] it is found that
incorporating gazetteer list can significantly improve the performance.
Gazetteer lists
has been created in [1] using
transliteration for month names, days, common locations, first name, middle
name, last name etc.
[2] has used
gazetteer lists of list of measures (kilogram, lacks), numerals and quantifiers
(first, second) and time expressions (date, month, minutes, hours).
2.5 Other methods
A phonetic
matching technique is harnessed in [12] for NER in Indian languages on the
basis of similar sounding property. They have used Stanford NER as the
reference entity database and have come up with a Hindi named entity database
using a phonetic matcher.
In [17],
external resources such as Wikipedia infobox features are used to infer entity
name along with word clustering algorithm to partition words into classes based
on their co occurrence statistics in a large corpera.
3. Feature Selection
When it comes to
feature selection, available word and tag context plays a major role. Many
systems seems to use binary features which represents the presence or absence
of a given property of a word.
Static words
(previous and next words), context lists (frequent words in a given window for
a particular class, E.g., Location class: city, going to), dynamic NE tag (NE
tag for previous word), first word, contains digit, numerical characters,
affixes (word suffix, word prefix), root information of word, Part of Speech
(POS) tag are used as features in [1] with MaxEnt model.
It is
highlighted in [1] that window of (w-2, w+2) gives the best results. Further,
it is evident in [1] that usage of complex feature set does not guarantee
better result.
[2] has used
language independent features such as window of the words (window size 5),
statistical suffixes for person and location entities (extracted as lists and
used as binary feature), prefixes (to avoid agglutinative nature/
postpositions), start of sentence and presence of digits.
Prefix and
suffix information is used as features in [3] as Indian languages are highly
inflected (window size 5). In addition, previous word tags, rare word (most
frequent words in language) and POS tags are used. Here, Oriya, Urdu and
Telingu languages have shown poor performance when compared to Hindi and
Bengali due to poor language features.
In [4],
“majority tag” is used as an additional feature, which uses contextual and
frequency information of other tags that are literally similar, to label an
unnamed tag.
In experiment
results of [5], it is highlighted that [-3, +2] window size gives the optimal
results and increasing the window size has decreased the f-measure.
4. Recognizing different entity types
Entity type
|
Method
|
Challenges
|
Person name
|
look up procedure, analyse local lexical context, looking at part of
sequence of candidate words (name component)
Features: POS tags, capitalization, decimal digits, bag of words,
Left and right context
Token legth
|
Name variations (same person referred in different names) - reuse of
name parts, morphological variants prefixes etc., transliteration differences
Person name can be proper noun
|
Organization
|
Use organization specific candidate words
|
Various ways of representing abbriviations
|
Place
|
Using gazetteer, trigger words (E.g., Newyork city)
|
Homographic with common names, historical variants, exonyms (foreign
variants), endonyms (local variants)
|
5. Summary and Conclusion
CRF based/
Hybrid/ Chain of named entity recognizers/ Rule based methods as post
processing mechanism
5. References
[1] A Hybrid
Approach for Named Entity Recognition in Indian Languages: NER for South and
South East Asian Languages: IJCNLP-08 Workshop: 2008
[2] Aggregating
Machine Learning and Rule Based Heuristics for Named Entity Recognition: NER
for South and South East Asian Languages: IJCNLP-08 Workshop: 2008
[3] Language
Independent Named Entity Recognition in Indian Language: NER for South and South
East Asian Languages: IJCNLP-08 Workshop: 2008
[4] Named Entity
Recognition for Telugu: NER for South and South East Asian Languages: IJCNLP-08
Workshop: 2008
[5] Bengali
Named Entity Recognition using Support Vector Machine: NER for South and South
East Asian Languages: IJCNLP-08 Workshop: 2008
[6] Domain
Focused Named Entity Recognizer for Tamil Using Conditional Random Fields: NER
for South and South East Asian Languages: IJCNLP-08 Workshop: 2008
[7] A Character
n-gram Based Approach for Improved Recall in Indian Language NER: NER for South
and South East Asian Languages: IJCNLP-08 Workshop: 2008
[8] An
Experiment on Automatic Detection of Named Entities in Bangla: NER for South
and South East Asian Languages: IJCNLP-08 Workshop: 2008
[9] A Hybrid
Named Entity Recognition System for South Asian Languages: NER for South and
South East Asian Languages: IJCNLP-08 Workshop: 2008
[10] CRF++: Yet
Another CRF toolkit: http://crfpp.googlecode.com/svn/trunk/doc/index.html
[11] Named
Entity Recognition for South Asian Languages: NER for South and South East
Asian Languages: IJCNLP-08 Workshop: 2008
[12] Named
Entity Recognition for Indian Languages: NER for South and South East Asian
Languages: IJCNLP-08 Workshop: 2008
[13] Experiments
in Telugu NER: A Conditional Random Field Approach: NER for South and South
East Asian Languages: IJCNLP-08 Workshop: 2008
[14] NERA: Named
Entity Recognition for Arabic: Journal of the American Society for Information
Science and Technology: Volume 60 Issue 8, August 2009 Pages 1652-1663
[15] Integrated Machine Learning Techniques for Arabic Named
Entity Recognition: International Journal of Computer Science Issues (IJCSI) .
Jul2010, Vol. 7 Issue 4, p27-36. 10p. 2 Charts, 11 Graphs.
[16] Chinese Word Segmentation and Named Entity Recognition: A
Pragmatic Approach: Microsoft Research - China
[17] A Named Entity Labeler for German: exploiting Wikipedia
and distributional clusters