Lifting the Hood on NLTK's NE Chunker

by Matt Johnson - Mon 23 May 2016
Tags: #machine learning #NLP

Python has a wonderful open-source library for performing NLP (natural language processing) on text. This library is called the Natural Language Toolkit (NLTK). This article is written for NLTK v3.0.5. NLTK gives you many features out-of-the-box. For example, you can obtain parts-of-speech (POS) tags of words in a sentence with the following code:

In [1]:
import nltk
sentence = 'This is a sentence.'
tokens = nltk.word_tokenize(sentence)
tagged_tokens = nltk.pos_tag(tokens)
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]

The focus of this writing, is the NLTK's NE (named entity) chunker, which I will abbreviate as a NEC. A named entity is something like Wal-Mart, Virginia, or Barack Obama. </p>

What a named entity is not, is something like store, walked, or saw. The definition of a chunk is a substring of text which cannot overlap another chunk. Consider the following sentence:

President Barack Obama is the 44th President of the United States of America.

In this example, you'll never end up receiving a chunk of Obama and Barack Obama. Can't happen--because in this example they would overlap. So the process of NE chunking, is identifying chunks in text which are NEs. Let's demonstrate the NLTK's NEC in action:

In [2]:
tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)
In [3]:
tokens = nltk.word_tokenize('I hate Apple products.')
tokens = nltk.pos_tag(tokens)
tree = nltk.ne_chunk(tokens)

These examples are chosen for a reason. Notice in both examples, the word Apple was POS tagged as NNP (NNP is the UPenn TreeBank II for a proper, singular noun). Also, both words begin with upper-case letters. So why did the NEC tag Apple as a NE in the first example, but not the second? Also, why did Apple get tagged as a GPE (geo-political entity)?

Machine Learning

The NLTK's NEC works by using a supervised machine learning algorithm known as a MaxEnt classifier. A MaxEnt classifier gets its name from maximum entropy. For a discrete probability distribution, maximum entropy is obtained when the distribution is uniform. A MaxEnt classifier is logistic regression. The difference is theoretical, because in the MaxEnt derivation, you assume maximum entropy and derive the sigmoid function. In the logistic regression derivation, you assume the sigmoid function. [J. Mount].

This machine learning model uses data from a corpus that has been manually annotated for NEs. A person, called an annotator, will read sentence after sentence and manually mark where the NEs are found in text. This is of course, a very tedious task. It is no wonder that most annotated corpora are not distributed for free. In fact, the NLTK does not provide you with the corpora it trained the NEC on (it was trained on data from ACE--Automatic Content Extraction). What the authors did provide, however, was a pickle file (a python serialized object) trained on this data. This pickle file, is a freeze-dried instance of the statistics needed for the MaxEnt classifier.

A note I'd like to add, is that the NLTK does provide NE annotated data found in corpora/ieer. However, unless that data is a good representation of the data you want to classify on, I wouldn't recommend using it. Also, you will have to write your own feature extractor for this, because the format in IEER is different than ACE.


The task of building a good supervised ML is identifying which features will work best. A feature is something you will compute statistics on. In NER, for example, one feature could be whether the word contains an upper-case letter (note: for Twitter data, this could be a bad feature). So what features are used in NLTK's NEC? I've listed them below:

  • The shape of the word (e.g., does it contain numbers? does it begin with a capital letter?)

  • The length of the word

  • The first three letters of the word

  • The last three letters of the word

  • The POS tag of the word

  • The word itself

  • Does the word exist in an English dictionary?

  • The tag of the word that precedes this word (i.e., was the previous word identified as a NE)

  • The POS tag of the preceding word

  • The POS tag of the following word

  • The word that precedes this word

  • The word that follows this word

  • The word combined with the POS tag of the following word

  • The POS tag of the word combined with the tag of the preceding word

  • The shape of the word combined with the tag of the preceding word </ul>

    Lifting the Hood

    As you can see, the list is long and will be hard to intuitively guess how the NEC will behave in different situations. To lift the hood on this, I've written some code using methods available in the NLTK to gain insight on why the NEC performs the way it does on different sentences:

In [4]:
# Loads the serialized NEChunkParser object
chunker ='chunkers/maxent_ne_chunker/english_ace_multiclass.pickle')

# The MaxEnt classifier
maxEnt = chunker._tagger.classifier()

def maxEnt_report():
    maxEnt = chunker._tagger.classifier()
    print("These are the labels used by the NLTK\'s NEC:\n")
    print("These are the most informative features found in the ACE corpora:\n")

def ne_report(sentence, report_all=False):
    tokens = nltk.word_tokenize(sentence)
    tagged_tokens = nltk.pos_tag(tokens)
    tags = []
    for i in xrange(len(tagged_tokens)):
        featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)
        tag = chunker._tagger.choose_tag(tagged_tokens, i, tags)
        if tag != 'O' or report_all:
            print '\nExplanation on the why the word \'' + tagged_tokens[i][0] + '\' was tagged:'
            featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags)

The first function, maxEnt_report(), just displays information specific to the MaxEnt classifier. Here's how it works. If you execute:

In [5]:
These are the labels used by the NLTK's NEC:

These are the most informative features found in the ACE corpora:

  10.125 bias==True and label is 'O'
   6.631 suffix3=='day' and label is 'O'
  -6.207 bias==True and label is 'I-GSP'
   5.628 prevtag=='O' and label is 'O'
  -4.740 shape=='upcase' and label is 'O'
   4.106 shape+prevtag=='<function shape at 0x8bde0d4>+O' and label is 'O'
  -3.994 shape=='mixedcase' and label is 'O'
   3.992 pos+prevtag=='NNP+B-PERSON' and label is 'I-PERSON'
   3.890 prevtag=='I-ORGANIZATION' and label is 'I-ORGANIZATION'
   3.879 shape+prevtag=='<function shape at 0x8bde0d4>+I-ORGANIZATION' and label is 'I-ORGANIZATION'

The first paragraph reports the labels used in the NLTK's NEC. The 'I-', 'O', and 'B-' prefixes require some explaining. These come from a form of tagging known as IOB (inside, outside, and begin) tagging. When a chunk begins, the first word is prefixed with a 'B' to indicate this word is the beginning of a chunk. The next word, if it belongs to the same chunk, would be prefixed with 'I' to indicate it's part of the chunk, but not the beginning. If a word does not belong to a chunk, it is labeled as 'O', meaning it is outside. The purpose of this notation is to satisfy the definition of a chunk.

Here's how the next method works:

In [6]:
ne_report('I am very excited about the next generation of Apple products.')
Explanation on the why the word 'Apple' was tagged:
  Feature                                            B-GPE       O B-ORGAN   B-GSP
  prevtag=='O' (1)                                   3.767
  shape=='upcase' (1)                                2.701
  pos+prevtag=='NNP+O' (1)                           2.254
  en-wordlist==False (1)                             2.095
  label is 'B-GPE' (1)                              -2.005
  bias==True (1)                                    -1.975
  prevword=='of' (1)                                 0.742
  pos=='NNP' (1)                                     0.681
  nextpos=='nns' (1)                                 0.661
  prevpos=='IN' (1)                                  0.311
  wordlen==5 (1)                                     0.113
  nextword=='products' (1)                           0.060
  bias==True (1)                                            10.125
  prevtag=='O' (1)                                           5.628
  shape=='upcase' (1)                                       -4.740
  prevpos=='IN' (1)                                         -1.668
  label is 'O' (1)                                          -1.075
  pos=='NNP' (1)                                            -1.024
  suffix3=='ple' (1)                                         0.797
  en-wordlist==False (1)                                     0.698
  wordlen==5 (1)                                            -0.449
  prevword=='of' (1)                                        -0.217
  nextpos=='nns' (1)                                         0.104
  prefix3=='app' (1)                                         0.089
  pos+prevtag=='NNP+O' (1)                                   0.011
  nextword=='products' (1)                                   0.005
  prevtag=='O' (1)                                                   3.389
  pos+prevtag=='NNP+O' (1)                                           1.725
  bias==True (1)                                                     0.955
  en-wordlist==False (1)                                             0.837
  label is 'B-ORGANIZATION' (1)                                      0.718
  nextpos=='nns' (1)                                                 0.365
  wordlen==5 (1)                                                    -0.351
  pos=='NNP' (1)                                                     0.174
  prevpos=='IN' (1)                                                 -0.139
  prevword=='of' (1)                                                 0.131
  prefix3=='app' (1)                                                -0.126
  shape=='upcase' (1)                                               -0.084
  suffix3=='ple' (1)                                                -0.077
  prevtag=='O' (1)                                                           2.925
  pos+prevtag=='NNP+O' (1)                                                   2.213
  shape=='upcase' (1)                                                        0.929
  en-wordlist==False (1)                                                     0.891
  bias==True (1)                                                            -0.592
  label is 'B-GSP' (1)                                                      -0.565
  prevpos=='IN' (1)                                                          0.410
  nextpos=='nns' (1)                                                         0.399
  pos=='NNP' (1)                                                             0.393
  prevword=='of' (1)                                                         0.184
  wordlen==5 (1)                                                             0.177
  TOTAL:                                             9.406   8.283   7.515   7.366
  PROBS:                                             0.453   0.208   0.122   0.110

It outputs only explanations for words that we tagged as a NE (if you want output for all words, regardless of whether it is a NE, set the second argument to True) I want to point out something here. The probabilities listed in the final row do not add up to one (they add up to ~0.89). This is because the output only displays the top four candidate labels.

Try to run this on the sentence:

I hate Apple products

and see if you can identify the features which caused it to miss being tagged.