Category Archives: Data

Named Entity Recognition (NER) Problem – Not Yet Solved

Automated recognition of named entities, a fundamental problem in Computational Linguistics, can benefit various language processing applications e.g., question answering, information extraction, etc. For example, the ability to accurately identify entities (e.g., PERSON (PER), LOCATION (LOC), ORGANIZATION (ORG) and MISCELLANEOUS (MISC)) is a prerequisite for identifying semantic relations between these entities. Spiderbook is a startup whose objective is to automatically extract vital information about businesses, their employees, and all the interrelationships between them in order to help the salesperson sell. These interrelationships are of type competition, partnership, acquisition, etc. Spiderbook’s Natural Language Processing (NLP) team is perfecting its ability to effectively extract such relationships from a massive number of documents on the Web. Spiderbook’s NLP team depends on a NER system to identify all possible companies/organizations in text to acquire the above mentioned relationships between them.

 

Amazon VS Amazon

Amazon VS Amazon

Considering the importance of the problem of NER, several research efforts have been done in academics to derive a better solution to this problem [1, 2, 3]. Two notable systems that have been proposed by NLP researchers for the current problem are the Illinois NER and Stanford NER [1, 2]. Both of these systems are capable of identifying PER, LOC, ORG, and MISC entities in text. On the evaluation data of CoNLL03, the Illinois NER system has shown to perform with a 3.94% better F1 score than the Stanford NER. The Illinois NER achieved a 90.80% F1 score on the CoNLL03 data set, a collection of Reuter’s 1996 news articles. Both the Illinois and Stanford NER systems produce their outputs on a sequence of words contained in a sentence. For example, on the sequence of words of the sentence, “Barack Hussein Obama is the 44th and current President of the United States” both NER systems produce the following output:

<PER>Barack Hussein Obama</PER> is the 44th and current President of the <LOC>United States</LOC>.

In order to determine if a words wi is some entity or a part of an entity, both NER systems rely upon a number of linguistic features, e.g. the system’s outputs on the two preceding words, and if the words wi is capitalized or not. An interesting feature utilized by both systems is the non-local feature for each entity. This feature ensures that all words representing the same entity in a document should be tagged with the same output. For example, the words “United States”, “USA”, “US”, “States”, “the United States of America” should all result in the same output, i.e. LOC. For enhanced performance, Ratinov and Roth (2009) [1] incorporated external knowledge in the Illinois NER in the form of highly precise gazetteers with wide coverage. These gazetteers contain 1.5 million entities, with names of people, locations, and organizations. In addition, they also used a clustering algorithm to assign the same output to similar entities. For example, Apple Inc. and IBM are similar entities and should be provided with the same output (i.e., ORG).

Although Illinois NER performs with a very high F1 score on the CoNLL03 data set, it needs to perform well on webpages as well. Spiderbook’s system extracts business relationships between companies or organizations collected from web pages. Through error analysis of Spiderbook’s system we noticed that around 28% of the errors in the extracted relationships are due to erroneous outputs by the NER systems. Table 1 shows some examples of errors made by the Illinois and Stanford NER systems, which affect performance of Spiderbook’s system for relations extraction.

 

. Outputs of Illinois NER and Stanford NER

In example 1 above, there are two organizations, Proctor & Gamble and Terra Technology, in a supplier relationship with each other (i.e., Terra Technology supplies their tool to Proctor & Gamble). This relationship is missed because the NER systems fail to identify that Proctor & Gamble and Terra Technology are different companies. In fact, example 1 has a simple Subject-Verb-Object structure, which remains unnoticed by both NER systems because all the words in the sentence start with uppercase letters. Upon fixing the case of the verb adopts, both systems recognize that Proctor & Gamble and Terra Technologies are different companies (see example 2 of Table 1). These two examples reveal that a module to improve the case of words in a sentence is critically needed to improve the NER output. Another error made by NER systems is the assignment of the ORG label to the expression “Terra Technology’s Demand Forecasting Tool”. The apostrophe in example 2 is showing that Terra Technology is a company and the Demand Forecasting Tool is the product of this company. An NER system needs to know how apostrophes work. Example 3 of Table 1 depicts a similar problem where both NER systems fail to detect that the expression “v.” reveals competition between two entities and thus “Allstate Fire and Casualty Insurance Company” and “Indemnity” are different entities.

 

Although both NER systems use gazetteers to collect information about the names of entities, these systems apparently do not know that Telx in example 4 is a telecommunication company. Thus, Spiderbook needs to have a more encompassing gazetteer in order to acquire information about all companies.

 

In example 5, the company Omni Travels has a partnership relationship with companies Apple Vacations, NCCL and SKA-Arabia. But the Illinois NER output failed to detect relationships of Omni Travels with Apple Vacations and SKA-Arabia, because Apple Vacations and SKA-Arabia were not identified as companies/organizations. NER systems need to be smarter in order to detect that all three companies appear in a list structure, i.e., entity 1, entity 2 and entity 3 and thus should all be assigned the same output i.e., ORG.

 

Above analysis of the outputs of NER systems has revealed that NLP researchers need to pay attention to the directions of improving cases of words, understanding linguistic structure of sentences, and increasing size of gazetteers to improve performance in entities recognition. A better business relations extraction capability could be achieved by Spiderbook’s system if the above mentioned issues with NER systems are addressed.

 

Mehwish Riaz
NLP Scientist – Spiderbook
 

[1] L. Ratinov and D. Roth, Design Challenges and Misconceptions in Named Entity Recognition. CoNLL (2009)

[2] J. R. Finkel, T. Grenager, and C. Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL 2005

[3] R. K. Ando and T. Zhang. A High-Performance Semi-Supervised Learning Method for Text Chunking. ACL 2005. Continue reading »

| Leave a comment

Gathering The Most Meaningful Data For Your Company

KillerS tartups Logo

Gathering data for sales is all about creating a better relationship with your customers and increasing sales. You can establish trust and a personal connection by checking out their social profiles, understand their needs by looking at their job posts, keep up on the latest developments by subscribing to Google Alerts, and understand their business priorities and risks by skimming their SEC filings.

 

The biggest thing to remember, though, is not to fall into the “Big Data Black Hole,” where nothing escapes and data isn’t useful.

There is often more data available about your customers and their companies than a salesperson can (or should) look at. Data can be distilled in different ways, but everyone needs access to a minimal data set that includes:

  • Type of industry
  • Amount of revenue
  • Employee count
  • Location
  • Key management
  • Interactions
  • Network

Beyond this minimal set, the data you need really depends on the product you’re selling or the type of business relationship you want to establish. What’s more, the data you gather doesn’t have to be about leads; there’s a lot of other useful information out there. For example, where did your customers hear about you first? Which social media sites should you focus on?

Read more…

 

| Leave a comment

Welcome to the Most Innovative Business for Data Solutions

Spiderbook most innovative business data solution

We’re excited to announce that Spiderbook has just won the DataWeek + API World 2014 award for Most Innovative Business Data Solution! Spiderbook was selected by crowd vote, where thousands of DevNetwork community members voted on the top Data + API technologies of 2014.

“This year’s DataWeek + API World crowd vote was our most active awards voting yet, with over 100 nominated technologies! What makes many award recipients this year stand out is the number of IT / infrastructure tools that are now available to developers or executives “as-a-service”. This shows how revolutionary Infrastructure-as-a-Service will be” – Geoff Domoracki, Founder of DataWeek + API World

As DataWeek + API World awards recipients, our team will be attending and participating in this year’s conference & expo. We’re offering 50 free OPEN passes (for the Expo, Keynotes, and Open Talks) to our community, register here:

https://dataweek14.eventbrite.com/?discount=dw14-award-guest

About DataWeek + API World 2014

DataWeek + API World 2014 Conference & Expo (Sept 13-17) is San Francisco’s largest Data + API conference of 2014 – where you can attend 100+ talks lead by executives and interact with 200+ new data & API technologies, DataWeek + API World includes speakers from Google, IBM, Linkedin, The Economist, ReadWrite, HP, Dun & Bradstreet, Leap Motion, Visual.ly, Oauth.io, and hundreds more covering topics across Big Data, Data Science-as-a-Service, API Design, Data Visualization, Connected Cars, and the Internet of Things.

| Leave a comment