Lifting the Hood on NLTK's NE Chunker
Python has a wonderful open-source library for performing NLP (natural language processing) on text. This library is called the Natural Language Toolkit (NLTK). This article is written for NLTK v3.0.5. NLTK gives you many features out-of-the-box. For example, you can obtain parts-of-speech (POS) tags of words in a sentence with the following code:
import nltk sentence = 'This is a sentence.' tokens = nltk.word_tokenize(sentence) tagged_tokens = nltk.pos_tag(tokens) print(tagged_tokens)
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sentence', 'NN'), ('.', '.')]
The focus of this writing, is the NLTK's NE (named entity) chunker, which I will abbreviate as a NEC. A named entity is something like Wal-Mart, Virginia, or Barack Obama. </p>
What a named entity is not, is something like store, walked, or saw. The definition of a chunk is a substring of text which cannot overlap another chunk. Consider the following sentence:
President Barack Obama is the 44th President of the United States of America.
In this example, you'll never end up receiving a chunk of Obama and Barack Obama. Can't happen--because in this example they would overlap. So the process of NE chunking, is identifying chunks in text which are NEs. Let's demonstrate the NLTK's NEC in action:
tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.') tokens = nltk.pos_tag(tokens) tree = nltk.ne_chunk(tokens) tree
tokens = nltk.word_tokenize('I hate Apple products.') tokens = nltk.pos_tag(tokens) tree = nltk.ne_chunk(tokens) tree
These examples are chosen for a reason. Notice in both examples, the word Apple was POS tagged as NNP (NNP is the UPenn TreeBank II for a proper, singular noun). Also, both words begin with upper-case letters. So why did the NEC tag Apple as a NE in the first example, but not the second? Also, why did Apple get tagged as a GPE (geo-political entity)?
Machine LearningThe NLTK's NEC works by using a supervised machine learning algorithm known as a MaxEnt classifier. A MaxEnt classifier gets its name from maximum entropy. For a discrete probability distribution, maximum entropy is obtained when the distribution is uniform. A MaxEnt classifier is logistic regression. The difference is theoretical, because in the MaxEnt derivation, you assume maximum entropy and derive the sigmoid function. In the logistic regression derivation, you assume the sigmoid function. [J. Mount].
This machine learning model uses data from a corpus that has been manually annotated for NEs. A person, called an annotator, will read sentence after sentence and manually mark where the NEs are found in text. This is of course, a very tedious task. It is no wonder that most annotated corpora are not distributed for free. In fact, the NLTK does not provide you with the corpora it trained the NEC on (it was trained on data from ACE--Automatic Content Extraction). What the authors did provide, however, was a pickle file (a python serialized object) trained on this data. This pickle file, is a freeze-dried instance of the statistics needed for the MaxEnt classifier.
A note I'd like to add, is that the NLTK does provide NE annotated data found in corpora/ieer. However, unless that data is a good representation of the data you want to classify on, I wouldn't recommend using it. Also, you will have to write your own feature extractor for this, because the format in IEER is different than ACE.
FeaturesThe task of building a good supervised ML is identifying which features will work best. A feature is something you will compute statistics on. In NER, for example, one feature could be whether the word contains an upper-case letter (note: for Twitter data, this could be a bad feature). So what features are used in NLTK's NEC? I've listed them below:
- The shape of the word (e.g., does it contain numbers? does it begin with a capital letter?)
- The length of the word
- The first three letters of the word
- The last three letters of the word
- The POS tag of the word
- The word itself
- Does the word exist in an English dictionary?
- The tag of the word that precedes this word (i.e., was the previous word identified as a NE)
- The POS tag of the preceding word
- The POS tag of the following word
- The word that precedes this word
- The word that follows this word
- The word combined with the POS tag of the following word
- The POS tag of the word combined with the tag of the preceding word
The shape of the word combined with the tag of the preceding word
Lifting the HoodAs you can see, the list is long and will be hard to intuitively guess how the NEC will behave in different situations. To lift the hood on this, I've written some code using methods available in the NLTK to gain insight on why the NEC performs the way it does on different sentences:
# Loads the serialized NEChunkParser object chunker = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle') # The MaxEnt classifier maxEnt = chunker._tagger.classifier() def maxEnt_report(): maxEnt = chunker._tagger.classifier() print("These are the labels used by the NLTK\'s NEC:\n") print(maxEnt.labels()) print("These are the most informative features found in the ACE corpora:\n") print(maxEnt.show_most_informative_features()) def ne_report(sentence, report_all=False): tokens = nltk.word_tokenize(sentence) tagged_tokens = nltk.pos_tag(tokens) tags =  for i in xrange(len(tagged_tokens)): featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags) tag = chunker._tagger.choose_tag(tagged_tokens, i, tags) if tag != 'O' or report_all: print '\nExplanation on the why the word \'' + tagged_tokens[i] + '\' was tagged:' featureset = chunker._tagger.feature_detector(tagged_tokens, i, tags) maxEnt.explain(featureset) tags.append(tag)
The first function,
maxEnt_report(), just displays information specific to the MaxEnt classifier. Here's how it works. If you execute:
These are the labels used by the NLTK's NEC: ['I-GSP', 'B-LOCATION', 'B-GPE', 'I-ORGANIZATION', 'I-PERSON', 'O', 'I-FACILITY', 'I-LOCATION', 'B-PERSON', 'B-FACILITY', 'B-GSP', 'B-ORGANIZATION', 'I-GPE'] These are the most informative features found in the ACE corpora: 10.125 bias==True and label is 'O' 6.631 suffix3=='day' and label is 'O' -6.207 bias==True and label is 'I-GSP' 5.628 prevtag=='O' and label is 'O' -4.740 shape=='upcase' and label is 'O' 4.106 shape+prevtag=='<function shape at 0x8bde0d4>+O' and label is 'O' -3.994 shape=='mixedcase' and label is 'O' 3.992 pos+prevtag=='NNP+B-PERSON' and label is 'I-PERSON' 3.890 prevtag=='I-ORGANIZATION' and label is 'I-ORGANIZATION' 3.879 shape+prevtag=='<function shape at 0x8bde0d4>+I-ORGANIZATION' and label is 'I-ORGANIZATION' None
The first paragraph reports the labels used in the NLTK's NEC. The 'I-', 'O', and 'B-' prefixes require some explaining. These come from a form of tagging known as IOB (inside, outside, and begin) tagging. When a chunk begins, the first word is prefixed with a 'B' to indicate this word is the beginning of a chunk. The next word, if it belongs to the same chunk, would be prefixed with 'I' to indicate it's part of the chunk, but not the beginning. If a word does not belong to a chunk, it is labeled as 'O', meaning it is outside. The purpose of this notation is to satisfy the definition of a chunk.
Here's how the next method works:
ne_report('I am very excited about the next generation of Apple products.')
Explanation on the why the word 'Apple' was tagged: Feature B-GPE O B-ORGAN B-GSP -------------------------------------------------------------------------------- prevtag=='O' (1) 3.767 shape=='upcase' (1) 2.701 pos+prevtag=='NNP+O' (1) 2.254 en-wordlist==False (1) 2.095 label is 'B-GPE' (1) -2.005 bias==True (1) -1.975 prevword=='of' (1) 0.742 pos=='NNP' (1) 0.681 nextpos=='nns' (1) 0.661 prevpos=='IN' (1) 0.311 wordlen==5 (1) 0.113 nextword=='products' (1) 0.060 bias==True (1) 10.125 prevtag=='O' (1) 5.628 shape=='upcase' (1) -4.740 prevpos=='IN' (1) -1.668 label is 'O' (1) -1.075 pos=='NNP' (1) -1.024 suffix3=='ple' (1) 0.797 en-wordlist==False (1) 0.698 wordlen==5 (1) -0.449 prevword=='of' (1) -0.217 nextpos=='nns' (1) 0.104 prefix3=='app' (1) 0.089 pos+prevtag=='NNP+O' (1) 0.011 nextword=='products' (1) 0.005 prevtag=='O' (1) 3.389 pos+prevtag=='NNP+O' (1) 1.725 bias==True (1) 0.955 en-wordlist==False (1) 0.837 label is 'B-ORGANIZATION' (1) 0.718 nextpos=='nns' (1) 0.365 wordlen==5 (1) -0.351 pos=='NNP' (1) 0.174 prevpos=='IN' (1) -0.139 prevword=='of' (1) 0.131 prefix3=='app' (1) -0.126 shape=='upcase' (1) -0.084 suffix3=='ple' (1) -0.077 prevtag=='O' (1) 2.925 pos+prevtag=='NNP+O' (1) 2.213 shape=='upcase' (1) 0.929 en-wordlist==False (1) 0.891 bias==True (1) -0.592 label is 'B-GSP' (1) -0.565 prevpos=='IN' (1) 0.410 nextpos=='nns' (1) 0.399 pos=='NNP' (1) 0.393 prevword=='of' (1) 0.184 wordlen==5 (1) 0.177 --------------------------------------------------------------------------------- TOTAL: 9.406 8.283 7.515 7.366 PROBS: 0.453 0.208 0.122 0.110
It outputs only explanations for words that we tagged as a NE (if you want output for all words, regardless of whether it is a NE, set the second argument to True) I want to point out something here. The probabilities listed in the final row do not add up to one (they add up to ~0.89). This is because the output only displays the top four candidate labels.
Try to run this on the sentence:
I hate Apple products
and see if you can identify the features which caused it to miss being tagged.