# Lifting the Hood on NLTK's NE Chunker

by Matt Johnson - Mon 23 May 2016

Python has a wonderful open-source library for performing NLP (natural language processing) on text. This library is called the Natural Language Toolkit (NLTK). This article is written for NLTK v3.0.5. NLTK gives you many features out-of-the-box. For example, you can obtain parts-of-speech (POS) tags of words in a sentence with the following code:

In [1]:
import nltk
sentence = 'This is a sentence.'
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
print(tagged_tokens)

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]


The focus of this writing, is the NLTK's NE (named entity) chunker, which I will abbreviate as a NEC. A named entity is something like Wal-Mart, Virginia, or Barack Obama. </p>

What a named entity is not, is something like store, walked, or saw. The definition of a chunk is a substring of text which cannot overlap another chunk. Consider the following sentence:

President Barack Obama is the 44th President of the United States of America.

In this example, you'll never end up receiving a chunk of Obama and Barack Obama. Can't happen--because in this example they would overlap. So the process of NE chunking, is identifying chunks in text which are NEs. Let's demonstrate the NLTK's NEC in action:

In [2]:
tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)
tree

Out[2]:
In [3]:
tokens = nltk.word_tokenize('I hate Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)
tree

Out[3]:

These examples are chosen for a reason. Notice in both examples, the word Apple was POS tagged as NNP (NNP is the UPenn TreeBank II for a proper, singular noun). Also, both words begin with upper-case letters. So why did the NEC tag Apple as a NE in the first example, but not the second? Also, why did Apple get tagged as a GPE (geo-political entity)?

# Machine Learning

The NLTK's NEC works by using a supervised machine learning algorithm known as a MaxEnt classifier. A MaxEnt classifier gets its name from maximum entropy. For a discrete probability distribution, maximum entropy is obtained when the distribution is uniform. A MaxEnt classifier is logistic regression. The difference is theoretical, because in the MaxEnt derivation, you assume maximum entropy and derive the sigmoid function. In the logistic regression derivation, you assume the sigmoid function. [J. Mount].

This machine learning model uses data from a corpus that has been manually annotated for NEs. A person, called an annotator, will read sentence after sentence and manually mark where the NEs are found in text. This is of course, a very tedious task. It is no wonder that most annotated corpora are not distributed for free. In fact, the NLTK does not provide you with the corpora it trained the NEC on (it was trained on data from ACE--Automatic Content Extraction). What the authors did provide, however, was a pickle file (a python serialized object) trained on this data. This pickle file, is a freeze-dried instance of the statistics needed for the MaxEnt classifier.

A note I'd like to add, is that the NLTK does provide NE annotated data found in corpora/ieer. However, unless that data is a good representation of the data you want to classify on, I wouldn't recommend using it. Also, you will have to write your own feature extractor for this, because the format in IEER is different than ACE.

# Features

The task of building a good supervised ML is identifying which features will work best. A feature is something you will compute statistics on. In NER, for example, one feature could be whether the word contains an upper-case letter (note: for Twitter data, this could be a bad feature). So what features are used in NLTK's NEC? I've listed them below:

• The shape of the word (e.g., does it contain numbers? does it begin with a capital letter?)

• The length of the word

• The first three letters of the word

• The last three letters of the word

• The POS tag of the word

• The word itself

• Does the word exist in an English dictionary?

• The tag of the word that precedes this word (i.e., was the previous word identified as a NE)

• The POS tag of the preceding word

• The POS tag of the following word

• The word that precedes this word

• The word that follows this word

• The word combined with the POS tag of the following word

• The POS tag of the word combined with the tag of the preceding word

• The shape of the word combined with the tag of the preceding word </ul>

# Lifting the Hood

As you can see, the list is long and will be hard to intuitively guess how the NEC will behave in different situations. To lift the hood on this, I've written some code using methods available in the NLTK to gain insight on why the NEC performs the way it does on different sentences:

In [4]:
# Loads the serialized NEChunkParser object

# The MaxEnt classifier
maxEnt = chunker._tagger.classifier()

def maxEnt_report():
maxEnt = chunker._tagger.classifier()
print("These are the labels used by the NLTK\'s NEC:\n")
print(maxEnt.labels())
print("These are the most informative features found in the ACE corpora:\n")
print(maxEnt.show_most_informative_features())

def ne_report(sentence, report_all=False):
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
tags = []
for i in xrange(len(tagged_tokens)):
featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)
tag = chunker._tagger.choose_tag(tagged_tokens, i, tags)
if tag != 'O' or report_all:
print '\nExplanation on the why the word \'' + tagged_tokens[i][0] + '\' was tagged:'
featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)
maxEnt.explain(featureset)
tags.append(tag)


The first function, maxEnt_report(), just displays information specific to the MaxEnt classifier. Here's how it works. If you execute:

In [5]:
maxEnt_report()

These are the labels used by the NLTK's NEC:

['I-GSP', 'B-LOCATION', 'B-GPE', 'I-ORGANIZATION', 'I-PERSON', 'O', 'I-FACILITY', 'I-LOCATION', 'B-PERSON', 'B-FACILITY', 'B-GSP', 'B-ORGANIZATION', 'I-GPE']
These are the most informative features found in the ACE corpora:

10.125 bias==True and label is 'O'
6.631 suffix3=='day' and label is 'O'
-6.207 bias==True and label is 'I-GSP'
5.628 prevtag=='O' and label is 'O'
-4.740 shape=='upcase' and label is 'O'
4.106 shape+prevtag=='<function shape at 0x8bde0d4>+O' and label is 'O'
-3.994 shape=='mixedcase' and label is 'O'
3.992 pos+prevtag=='NNP+B-PERSON' and label is 'I-PERSON'
3.890 prevtag=='I-ORGANIZATION' and label is 'I-ORGANIZATION'
3.879 shape+prevtag=='<function shape at 0x8bde0d4>+I-ORGANIZATION' and label is 'I-ORGANIZATION'
None


The first paragraph reports the labels used in the NLTK's NEC. The 'I-', 'O', and 'B-' prefixes require some explaining. These come from a form of tagging known as IOB (inside, outside, and begin) tagging. When a chunk begins, the first word is prefixed with a 'B' to indicate this word is the beginning of a chunk. The next word, if it belongs to the same chunk, would be prefixed with 'I' to indicate it's part of the chunk, but not the beginning. If a word does not belong to a chunk, it is labeled as 'O', meaning it is outside. The purpose of this notation is to satisfy the definition of a chunk.

Here's how the next method works:

In [6]:
ne_report('I am very excited about the next generation of Apple products.')

Explanation on the why the word 'Apple' was tagged:
Feature                                            B-GPE       O B-ORGAN   B-GSP
--------------------------------------------------------------------------------
prevtag=='O' (1)                                   3.767
shape=='upcase' (1)                                2.701
pos+prevtag=='NNP+O' (1)                           2.254
en-wordlist==False (1)                             2.095
label is 'B-GPE' (1)                              -2.005
bias==True (1)                                    -1.975
prevword=='of' (1)                                 0.742
pos=='NNP' (1)                                     0.681
nextpos=='nns' (1)                                 0.661
prevpos=='IN' (1)                                  0.311
wordlen==5 (1)                                     0.113
nextword=='products' (1)                           0.060
bias==True (1)                                            10.125
prevtag=='O' (1)                                           5.628
shape=='upcase' (1)                                       -4.740
prevpos=='IN' (1)                                         -1.668
label is 'O' (1)                                          -1.075
pos=='NNP' (1)                                            -1.024
suffix3=='ple' (1)                                         0.797
en-wordlist==False (1)                                     0.698
wordlen==5 (1)                                            -0.449
prevword=='of' (1)                                        -0.217
nextpos=='nns' (1)                                         0.104
prefix3=='app' (1)                                         0.089
pos+prevtag=='NNP+O' (1)                                   0.011
nextword=='products' (1)                                   0.005
prevtag=='O' (1)                                                   3.389
pos+prevtag=='NNP+O' (1)                                           1.725
bias==True (1)                                                     0.955
en-wordlist==False (1)                                             0.837
label is 'B-ORGANIZATION' (1)                                      0.718
nextpos=='nns' (1)                                                 0.365
wordlen==5 (1)                                                    -0.351
pos=='NNP' (1)                                                     0.174
prevpos=='IN' (1)                                                 -0.139
prevword=='of' (1)                                                 0.131
prefix3=='app' (1)                                                -0.126
shape=='upcase' (1)                                               -0.084
suffix3=='ple' (1)                                                -0.077
prevtag=='O' (1)                                                           2.925
pos+prevtag=='NNP+O' (1)                                                   2.213
shape=='upcase' (1)                                                        0.929
en-wordlist==False (1)                                                     0.891
bias==True (1)                                                            -0.592
label is 'B-GSP' (1)                                                      -0.565
prevpos=='IN' (1)                                                          0.410
nextpos=='nns' (1)                                                         0.399
pos=='NNP' (1)                                                             0.393
prevword=='of' (1)                                                         0.184
wordlen==5 (1)                                                             0.177
---------------------------------------------------------------------------------
TOTAL:                                             9.406   8.283   7.515   7.366
PROBS:                                             0.453   0.208   0.122   0.110


It outputs only explanations for words that we tagged as a NE (if you want output for all words, regardless of whether it is a NE, set the second argument to True) I want to point out something here. The probabilities listed in the final row do not add up to one (they add up to ~0.89). This is because the output only displays the top four candidate labels.

Try to run this on the sentence:

I hate Apple products

and see if you can identify the features which caused it to miss being tagged.