Research

Sinhala Named Entity Recognition - In progress... (LTRL:UCSC research initiative)


Name Entity Recognition (NER) is a significant method for extracting structured information from unstructured text and organise information in a semantically accurate form for further inference and decision making.

NER has been a key pre-processing step for most of the natural language processing applications such as information extraction, machine translation, information retrieval, topic detection, text summarization and automatic question answering tasks.

Due to the diverse language characteristics, NER can be declared as language/ domain specific task. For languages such as English and German, named entity recognition has been easier task when compared to Asian languages due to beneficial orthographical features (E.g., nouns begin with a capital letter).

I have reviewed state of the art mechanisms on Named Entity Recognition for different languages. Currently working on a Named Entity Recognition solution for Sinhala language using hybrid approach between statistical and rule based methods. Main focus is to apply Conditional Random Fields (CRF) for Named Entity Recognition task and come up with a rich feature template for Sinhala Language

Tools and/ or Technologies: CRF++,  Machine Learning, Natural Language Processing

Related blog posts:
Named Entity Recognition using Conditional Random Fields

Source code:
https://bitbucket.org/jaywith/sinhala-named-entity-recognition

Sensefy: Machine Learning: Topic Annotation (Zaizi R&D)


Topic modelling is a machine learning technique to achieve the above task using the "automatically generated topics"  with minimal human effort and no training of software.

Applications:
  • Document organization (Tag recommendations, extended taxonomy/ folksonomy, content indexing, meta content, semantic categories)
  • Document retrieval (Contextual search filtering, Enhanced faceted search, search recommendations)
  • Document analysis (Text analytic) 
Related blog posts:
Infer topics for documents using LDA
Difference between topic modelling and clustering

Content Extraction and Context Inference based Information Retrieval Framework (BEng. final year research project - IIT/ University of Westminster)


At present, most of the information retrieval mechanisms consider only exact matches of textual meta data such as topics, manual tags and descriptions etc. These methods are yet to provide the right information to match the level of human intuition driven relevance. The main contributor to such factors is the lack of assessing relevance of the content and context of the available data in a unified manner. Extracting semantic content and inferring knowledge from low level features has always been a major challenge owing to the well-known issue of the semantic gap. This research project strives to overcome the above mentioned difficulty by providing a framework based approach using machine learning and knowledge representation where right information can be retrieved regardless of the content format or contextual discrepancies. Given that information can be embedded as any content format, the proposed framework analyses and provides a set of content and context descriptors which can be used in any information retrieval application.

Demo:
http://www.youtube.com/watch?v=T8jM74LT0BE

Tools and/ or Technologies:  Image and video processing, Machine Learning, Knowledge Representation, Semantic networks, EmguCV, WordNet

Related blog posts:
Content Extraction and Context Inference based Information Retrieval Framework

Source code:
https://bitbucket.org/jaywith/content-extraction-and-context-inference-based-information

Automated Database Standards Checker (Pearson: Database Development Initiative)


Designed and developed unified framework to automate the database code review process to evaluate customized coding standards using rule based mechanism with SQL language parser. 

Tools and/ or Technologies: ANTLR, C#, SQL Server 2008, Natural Language Processing (NLP), Parsing

Related blog posts:
Using ANTLR for language parsing