Named Entity Recognition (NER) Problem – Not Yet Solved

Automated recognition of named entities, a fundamental problem in Computational Linguistics, can benefit various language processing applications e.g., question answering, information extraction, etc. For example, the ability to accurately identify entities (e.g., PERSON (PER), LOCATION (LOC), ORGANIZATION (ORG) and MISCELLANEOUS (MISC)) is a prerequisite for identifying semantic relations between these entities. Spiderbook is a startup whose objective is to automatically extract vital information about businesses, their employees, and all the interrelationships between them in order to help the salesperson sell. These interrelationships are of type competition, partnership, acquisition, etc. Spiderbook’s Natural Language Processing (NLP) team is perfecting its ability to effectively extract such relationships from a massive number of documents on the Web. Spiderbook’s NLP team depends on a NER system to identify all possible companies/organizations in text to acquire the above mentioned relationships between them.

 

Amazon VS Amazon

Amazon VS Amazon

Considering the importance of the problem of NER, several research efforts have been done in academics to derive a better solution to this problem [1, 2, 3]. Two notable systems that have been proposed by NLP researchers for the current problem are the Illinois NER and Stanford NER [1, 2]. Both of these systems are capable of identifying PER, LOC, ORG, and MISC entities in text. On the evaluation data of CoNLL03, the Illinois NER system has shown to perform with a 3.94% better F1 score than the Stanford NER. The Illinois NER achieved a 90.80% F1 score on the CoNLL03 data set, a collection of Reuter’s 1996 news articles. Both the Illinois and Stanford NER systems produce their outputs on a sequence of words contained in a sentence. For example, on the sequence of words of the sentence, “Barack Hussein Obama is the 44th and current President of the United States” both NER systems produce the following output:

<PER>Barack Hussein Obama</PER> is the 44th and current President of the <LOC>United States</LOC>.

In order to determine if a words wi is some entity or a part of an entity, both NER systems rely upon a number of linguistic features, e.g. the system’s outputs on the two preceding words, and if the words wi is capitalized or not. An interesting feature utilized by both systems is the non-local feature for each entity. This feature ensures that all words representing the same entity in a document should be tagged with the same output. For example, the words “United States”, “USA”, “US”, “States”, “the United States of America” should all result in the same output, i.e. LOC. For enhanced performance, Ratinov and Roth (2009) [1] incorporated external knowledge in the Illinois NER in the form of highly precise gazetteers with wide coverage. These gazetteers contain 1.5 million entities, with names of people, locations, and organizations. In addition, they also used a clustering algorithm to assign the same output to similar entities. For example, Apple Inc. and IBM are similar entities and should be provided with the same output (i.e., ORG).

Although Illinois NER performs with a very high F1 score on the CoNLL03 data set, it needs to perform well on webpages as well. Spiderbook’s system extracts business relationships between companies or organizations collected from web pages. Through error analysis of Spiderbook’s system we noticed that around 28% of the errors in the extracted relationships are due to erroneous outputs by the NER systems. Table 1 shows some examples of errors made by the Illinois and Stanford NER systems, which affect performance of Spiderbook’s system for relations extraction.

 

. Outputs of Illinois NER and Stanford NER

In example 1 above, there are two organizations, Proctor & Gamble and Terra Technology, in a supplier relationship with each other (i.e., Terra Technology supplies their tool to Proctor & Gamble). This relationship is missed because the NER systems fail to identify that Proctor & Gamble and Terra Technology are different companies. In fact, example 1 has a simple Subject-Verb-Object structure, which remains unnoticed by both NER systems because all the words in the sentence start with uppercase letters. Upon fixing the case of the verb adopts, both systems recognize that Proctor & Gamble and Terra Technologies are different companies (see example 2 of Table 1). These two examples reveal that a module to improve the case of words in a sentence is critically needed to improve the NER output. Another error made by NER systems is the assignment of the ORG label to the expression “Terra Technology’s Demand Forecasting Tool”. The apostrophe in example 2 is showing that Terra Technology is a company and the Demand Forecasting Tool is the product of this company. An NER system needs to know how apostrophes work. Example 3 of Table 1 depicts a similar problem where both NER systems fail to detect that the expression “v.” reveals competition between two entities and thus “Allstate Fire and Casualty Insurance Company” and “Indemnity” are different entities.

 

Although both NER systems use gazetteers to collect information about the names of entities, these systems apparently do not know that Telx in example 4 is a telecommunication company. Thus, Spiderbook needs to have a more encompassing gazetteer in order to acquire information about all companies.

 

In example 5, the company Omni Travels has a partnership relationship with companies Apple Vacations, NCCL and SKA-Arabia. But the Illinois NER output failed to detect relationships of Omni Travels with Apple Vacations and SKA-Arabia, because Apple Vacations and SKA-Arabia were not identified as companies/organizations. NER systems need to be smarter in order to detect that all three companies appear in a list structure, i.e., entity 1, entity 2 and entity 3 and thus should all be assigned the same output i.e., ORG.

 

Above analysis of the outputs of NER systems has revealed that NLP researchers need to pay attention to the directions of improving cases of words, understanding linguistic structure of sentences, and increasing size of gazetteers to improve performance in entities recognition. A better business relations extraction capability could be achieved by Spiderbook’s system if the above mentioned issues with NER systems are addressed.

 

Mehwish Riaz
NLP Scientist – Spiderbook
 

[1] L. Ratinov and D. Roth, Design Challenges and Misconceptions in Named Entity Recognition. CoNLL (2009)

[2] J. R. Finkel, T. Grenager, and C. Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005

[3] R. K. Ando and T. Zhang. A High-Performance Semi-Supervised Learning Method for Text Chunking. ACL 2005.

This entry was posted in Data. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Post a comment or leave a trackback: Trackback URL.