Lifting the Hood on NLTK's NE Chunker
Python has a wonderful open-source library for performing NLP (natural language processing) on text. This library is called the Natural Language Toolkit (NLTK). This article is written for NLTK v3.0.5. NLTK gives you many features out-of-the-box. For example, you can obtain parts-of-speech (POS) tags of words in a sentence with the following code:
import nltk
sentence = 'This is a sentence.'
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)
The focus of this writing, is the NLTK's NE (named entity) chunker, which I will abbreviate as a NEC. A named entity is something like Wal-Mart, Virginia, or Barack Obama. </p>
What a named entity is not, is something like store, walked, or saw. The definition of a chunk is a substring of text which cannot overlap another chunk. Consider the following sentence:
President Barack Obama is the 44th President of the United States of America.
In this example, you'll never end up receiving a chunk of Obama and Barack Obama. Can't happen--because in this example they would overlap. So the process of NE chunking, is identifying chunks in text which are NEs. Let's demonstrate the NLTK's NEC in action:
tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)
tree
tokens = nltk.word_tokenize('I hate Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)
tree
These examples are chosen for a reason. Notice in both examples, the word Apple was POS tagged as NNP (NNP is the UPenn TreeBank II for a proper, singular noun). Also, both words begin with upper-case letters. So why did the NEC tag Apple as a NE in the first example, but not the second? Also, why did Apple get tagged as a GPE (geo-political entity)?
Machine Learning
The NLTK's NEC works by using a supervised machine learning algorithm known as a MaxEnt classifier. A MaxEnt classifier gets its name from maximum entropy. For a discrete probability distribution, maximum entropy is obtained when the distribution is uniform. A MaxEnt classifier is logistic regression. The difference is theoretical, because in the MaxEnt derivation, you assume maximum entropy and derive the sigmoid function. In the logistic regression derivation, you assume the sigmoid function. [J. Mount].This machine learning model uses data from a corpus that has been manually annotated for NEs. A person, called an annotator, will read sentence after sentence and manually mark where the NEs are found in text. This is of course, a very tedious task. It is no wonder that most annotated corpora are not distributed for free. In fact, the NLTK does not provide you with the corpora it trained the NEC on (it was trained on data from ACE--Automatic Content Extraction). What the authors did provide, however, was a pickle file (a python serialized object) trained on this data. This pickle file, is a freeze-dried instance of the statistics needed for the MaxEnt classifier.
A note I'd like to add, is that the NLTK does provide NE annotated data found in corpora/ieer. However, unless that data is a good representation of the data you want to classify on, I wouldn't recommend using it. Also, you will have to write your own feature extractor for this, because the format in IEER is different than ACE.
Features
The task of building a good supervised ML is identifying which features will work best. A feature is something you will compute statistics on. In NER, for example, one feature could be whether the word contains an upper-case letter (note: for Twitter data, this could be a bad feature). So what features are used in NLTK's NEC? I've listed them below:- The shape of the word (e.g., does it contain numbers? does it begin with a capital letter?)
- The length of the word
- The first three letters of the word
- The last three letters of the word
- The POS tag of the word
- The word itself
- Does the word exist in an English dictionary?
- The tag of the word that precedes this word (i.e., was the previous word identified as a NE)
- The POS tag of the preceding word
- The POS tag of the following word
- The word that precedes this word
- The word that follows this word
- The word combined with the POS tag of the following word
- The POS tag of the word combined with the tag of the preceding word
-
The shape of the word combined with the tag of the preceding word
</ul>
Lifting the Hood
As you can see, the list is long and will be hard to intuitively guess how the NEC will behave in different situations. To lift the hood on this, I've written some code using methods available in the NLTK to gain insight on why the NEC performs the way it does on different sentences:
# Loads the serialized NEChunkParser object
chunker = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle')
# The MaxEnt classifier
maxEnt = chunker._tagger.classifier()
def maxEnt_report():
maxEnt = chunker._tagger.classifier()
print("These are the labels used by the NLTK\'s NEC:\n")
print(maxEnt.labels())
print("These are the most informative features found in the ACE corpora:\n")
print(maxEnt.show_most_informative_features())
def ne_report(sentence, report_all=False):
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
tags = []
for i in xrange(len(tagged_tokens)):
featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)
tag = chunker._tagger.choose_tag(tagged_tokens, i, tags)
if tag != 'O' or report_all:
print '\nExplanation on the why the word \'' + tagged_tokens[i][0] + '\' was tagged:'
featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)
maxEnt.explain(featureset)
tags.append(tag)
The first function, maxEnt_report()
, just displays information specific to the MaxEnt classifier. Here's how it works. If you execute:
maxEnt_report()
The first paragraph reports the labels used in the NLTK's NEC. The 'I-', 'O', and 'B-' prefixes require some explaining. These come from a form of tagging known as IOB (inside, outside, and begin) tagging. When a chunk begins, the first word is prefixed with a 'B' to indicate this word is the beginning of a chunk. The next word, if it belongs to the same chunk, would be prefixed with 'I' to indicate it's part of the chunk, but not the beginning. If a word does not belong to a chunk, it is labeled as 'O', meaning it is outside. The purpose of this notation is to satisfy the definition of a chunk.
Here's how the next method works:
ne_report('I am very excited about the next generation of Apple products.')
It outputs only explanations for words that we tagged as a NE (if you want output for all words, regardless of whether it is a NE, set the second argument to True) I want to point out something here. The probabilities listed in the final row do not add up to one (they add up to ~0.89). This is because the output only displays the top four candidate labels.
Try to run this on the sentence:
I hate Apple products
and see if you can identify the features which caused it to miss being tagged.
References
[1] J. Mount, The equivalence of logistic regression and maximum entropy models