Bernoulli Naive Bayes Classifier

by Matt Johnson - Tue 07 June 2016
Tags: #machine learning #NLP

From this article we learned the fundamentals of how a naive Bayes classifier works. Our task is to calculate the following:

$\text{log}\,p(Y) + \sum_{j = 1}^{P} \; \text{log} \, p(X_j \, | \, Y)$

As pointed out, the different naive Bayes algorithms result from different choices of how to model $p(Y)$ and $p(X|Y)$. This article teaches you how to implement a naive Bayes classifier when we model $p(X|Y)$ as a Bernoulli distribution.

Bernoulli Naive Bayes

For the Bernoulli naive Bayes classifier, we let $X = \{0,1\}$.

Then, we let $p(X|Y)$ be modeled as Bernoulli distribution:

$p(X|Y) = \theta^{X} (1 - \theta)^{1-X}$

We have to model a Bernoulli distribution for each class and each feature, so our terms look like:

$p(X_j \, | \, Y = y_k) = \theta_{kj}^{X_j} (1 - \theta_{kj})^{1-X_j}$

From this, we have $KP$ parameters to esimate. Also, what about $p(Y)$? Well, the outcome can only be one of K outcomes. This is like rolling a die and is a generalization of the Bernoulli distribution. This distribution, is known as a categorical probability distribution (some refer to it as a multinoulli distribution). We can model this as a categorical probability:

$p(Y = y_k) = \text{Cat}(Y = y_k | \, \vec{\theta}) = \Pi_i^{K} \theta_{i=1}^{[i=k]}$

Above, I have used Iverson bracket notation. So that $[i=k]$ is 1 when $i = k$, otherwise, it is 0.

So now, we have $KP$ $\theta$s parameters to estimate and $K-1$ parameters to esimate for $P(Y)$. So now, how we estimate these parameters from data?

Calculating Likelihood Function for $\theta$

Well, for a given class and a given feature, we have to estimate $\theta_{kj}$. Let's go back to the Bernoulli distribution:

$\text{Bern}(X|\theta) = \text{p}(X|\theta) = \theta^X (1 - \theta)^{1-X}$

Suppose we have data $D = \{X_1, X_2, X_3, ..., X_N\}$. What is the probability of observing this sequence? Well, assuming each observation is independent of the other...

$\text{p}(D|\theta) = \theta^X_1 (1 - \theta)^{1-X_1}\theta^{X_2} (1 - \theta)^{1-X_2}...(1 - \theta)^{1-X_N}$

We can gather like terms:

$\text{p}(D|\theta) = \theta^{\sum X_j} (1 - \theta)^{\sum 1-X_j}$

But this can further simplified if we let $m$ represent the number of times $X_j = 1$. Also, $N$ is the number of training examples. Thus:

$\text{p}(D|\theta) = \theta^{m} (1 - \theta)^{N-m}$

MLE for $\theta$

Now we have an equation for the likelihood of $\theta$:

$\text{p}(D|\theta) = \theta^{m} (1 - \theta)^{N-m}$

There are lots of possible $\theta$s we can choose. But we want the one that agrees with our data the most. Which one is that? Well, it would be the $\theta$ that maximizes $\text{p}(D|\theta)$. To find that $\theta$, we can use calculus:

$\frac{\partial}{\partial \theta} \text{p}(D|\theta) = 0$

Using the product rule, we get:

$ \frac{\partial}{\partial \theta} \big( \theta^{m} \big) (1 - \theta)^{N-m} + \big( \theta^{m} \big) \frac{\partial}{\partial \theta} \big[ (1 - \theta)^{N-m} \big] = 0$

Calculating the partial derivative of the first term and application of the chain rule for the second term yields:

$ m \theta^{m-1} (1 - \theta)^{N-m} - \theta^{m} (N-m)(1 - \theta)^{N-m-1} = 0$

Dividing both sides by $\theta^{m - 1}$ and $\theta^{N - m - 1}$ yields:

$ m(1 - \theta) - \theta(N-m) = 0$


$ m - m\theta - N \theta - m \theta = 0$

$m\theta$ cancels, solving for $\theta$ we get:

$\theta = \frac{m}{N}$

But we do this for a given class, so we get:

$\theta_{kj} = \frac{m_{kj}}{N_k}$

Serously? This is simply the fraction of examples, for a given class, that contain the particular feature. This method is known as maximum likelihood estimation or MLE for short. Also, taking the log of the likelihood function first makes the calculus easier. Try it for yourself.

Calculating Likelihood Function for $\vec{\theta}$

Now we have to estimate $\theta$ for our categorical distribution:

$\text{Cat}(Y = y_k | \, \vec{\theta}) = \Pi_i^{K} \theta_{i}^{[i=k]}$

Suppose we have data $D = \{Y_1, Y_2, Y_3,...,Y_N\}$. What is the probability of observing this sequence? Well, assuming each observation is independent of the other, we get:

$\text{p}(D \, | \, \vec{\theta}) = \theta_{1}^{m_1} \theta_{2}^{m_2}...\theta_{K}^{m_K}$

where, $m_1 + m_2 + ... + \, m_K = N$

MLE for $\vec{\theta}$

Now we have an equation for the likelihood of $\vec{\theta}$

$\text{p}(D \, | \, \vec{\theta}) = \theta_{1}^{m_1}\theta_{2}^{m_2}...\theta_{K}^{m_K}$

We'll use MLE again. Taking the log of this likelihood makes the math easier and does not affect the result:

$\text{log} \, \text{p}(D \, | \, \vec{\theta}) = m_1 \text{log} \theta_{1} + \, m_2 \text{log} \theta_{2} \, + ... + \, m_K \text{log} \theta_{K} \,$

Our problem is to maximize $\vec{\theta}$:

$\text{log} \, \text{p}(D \, | \, \vec{\theta}) = \sum_i^{K} m_i \text{log} \theta_{i}$

subject to the constraint

$\sum_{i=1}^{K} \theta_i = 1$

This is a perfect application for a Langrange multiplier. The langrange function becomes:

$L = \sum_{i=1}^{K} m_i \text{log}_i + \lambda (1 - \sum_{i=1}^K \theta_i)$

Taking the partial derivative with respect to $\theta_{k}$ and $\lambda$, and setting equations to 0 yields:

$\frac{\partial L}{\partial \theta_k} = \frac{m_k}{\theta_k} - \lambda = 0$

$\frac{\partial L}{\partial \lambda} = (1 - \sum_{i=1}^{K} \theta_i) = 0$

If we take the sum of our first result:

$\sum m_k = N = \lambda \sum_i \theta_i = \lambda$

Thus, for a particular $\theta_k$, we have:

$\frac{m_k}{\theta_k} - N = 0$

Solving for ${\theta_k}$ gives:

$\theta_k = \frac{m_k}{N}$

So the MLE esimate simply means calculating the fraction of training examples for a given class.

Putting it All Together

All we have to do now is, is get some training examples, and compute the fraction of times a particular feature occurs in each example for a given class. Also, we will compute the fraction of times a particular class occurs across all training examples.

Once we have these parameters estimated, we can extract features from a new example and calculate:

$\text{log} \, \vec{\theta}_k + \sum_{j=1}^{K} \text{log} \, \theta_{kj}$

for each class. We then take the class with the highest score

Zero Count Problem

Almost there, but we have one last issue to address. During training, we may encounter a feature that shows up in one class, but not in the other. If this occurrs, should we assign a zero probability? Well there are reasons we shouldn't. First, if we did, our log sum score would diverge to negative infinity. This means, is that for a MAP estimate, we would pretty much never pick this class. Next, and more importantly, there seems something strange about assigning a zero probablity estimate. Do you believe with 100% confidence that this feature should never appear under the given class? If the answer is no, and you wish to put this belief in, then you are a Bayesian as opposed to a frequentist!

To encode this belief into our model, we must introduce a prior. Let's work out the math for this:

$p(\theta | D) \propto p(\theta)p(D | \theta)$

Here, we are trying to calculate the posterior probability of our parameter. And from this, we will take a MAP estimate instead of MLE. The prior commonly used for a Bernoulli distribution is the beta distribution:

$p(\theta | a, b) = \frac{\theta^{a - 1}(1 - \theta)^{\beta - 1}}{\beta(a,b)}$

The parameters a and b are hyperparameters. We will manually choose these shortly (this is the controversial part of Bayesian statistics). We already know $p(D | \theta)$. We can plug these equations in:

$p(\theta | D) \propto \theta^{a - 1}(1 - \theta)^{\beta - 1} \theta^{n_1}(1 - \theta)^{n_2}$

I purposely did not carry over the normalization constant $\beta(a,b)$. We will reintroduce this at the end to form a valid probability distribution. Combining things:

$p(\theta | D) \propto \theta^{n_1 + a - 1}(1 - \theta)^{n_2 + \beta - 1}$

Normalization yields (you can just 'look' at the beta distribution, and figure out what numbers to plug into the normalization beta function constant)

$p(\theta | D) = \frac{\theta^{n_1 + a - 1}(1 - \theta)^{n_2 + \beta - 1}}{\beta(a + n_1, b + n_2)}$

Now, we take the MAP of this. Which is taking the mode. The mode of this beta distribution is:

$\hat{\theta}_{\text{map}} = \frac{\alpha + n_1 - 1}{\alpha + \beta + n_1 + n_2 - 2}$

Now, $n_1 + n_2 = N$, that is, the number of documents for a given class. Now, if we strategically pick $a = b = 2$ we get the following:

$\hat{\theta}_{\text{map}} = \frac{n_1 + 1}{N + 2}$

Thus, for each feature, we calculate the fraction of documents that are of the class (i.e., N). Then, we calculate the fraction of documents in the given class that contain the feature (i.e., n). Note that if N=0 we get an estimate of 0.5. Thus, before we observe any data, we estimate that each feature has a 50% chance of occurring.

Training Data

The link contains data we can learn from. It is from the LingSpam dataset and contains documents labeled as 'spam' or 'nonspam'. The code in the next section implements the Bernoulli naive Bayes classifier. We train on a subset of the data and then test on data we did not train on. We then calculate the fraction of cases we get wrong. This is called the error rate. There are of course, fancier techniques. But this simple measure will suffice to get the point across.

Finally, the Code

We now construct the class for a Bernoulli naive Bayes classifier. This code is specific to classifying textual documents, but as long as your features can be converted to binary, it can be whatever you like!

In [1]:
"""Builds a Bernoulli naive Bayes classifier

from math import log
import glob
from collections import Counter

def get_features(text):
    """Extracts features from text

        text (str): A blob of unstructured text
    return set([w.lower() for w in text.split(" ")])

class BernoulliNBTextClassifier(object):

    def __init__(self):
        self._log_priors = None
        self._cond_probs = None
        self.features = None

    def train(self, documents, labels):
        """Train a Bernoulli naive Bayes classifier

            documents (list): Each element in this list
                is a blog of text
            labels (list): The ground truth label for
                each document

        """Compute log( P(Y) )
        label_counts = Counter(labels)
        N = float(sum(label_counts.values()))
        self._log_priors = {k: log(v/N) for k, v in label_counts.iteritems()}

        """Feature extraction
        # Extract features from each document
        X = [set(get_features(d)) for d in documents]

        # Get all features
        self.features = set([f for features in X for f in features])

        """Compute log( P(X|Y) )

           Use Laplace smoothing
           n1 + 1 / (n1 + n2 + 2)
        self._cond_probs = {l: {f: 0. for f in self.features} for l in self._log_priors}

        # Step through each document
        for x, l in zip(X, labels):
            for f in x:
                self._cond_probs[l][f] += 1.

        # Now, compute log probs
        for l in self._cond_probs:
            N = label_counts[l]
            self._cond_probs[l] = {f: (v + 1.) / (N + 2.) for f, v in self._cond_probs[l].iteritems()}

    def predict(self, text):
        """Make a prediction from text

        # Extract features
        x = get_features(text)

        pred_class = None
        max_ = float("-inf")

        # Perform MAP estimation
        for l in self._log_priors:
            log_sum = self._log_priors[l]
            for f in self.features:
                prob = self._cond_probs[l][f]
                log_sum += log(prob if f in x else 1. - prob)
            if log_sum > max_:
                max_ = log_sum
                pred_class = l

        return pred_class

def get_labeled_data(type_):
    """Get data from:
        Create a folder named 'emails' with content from ex6DataEmails in
        same directory as this script
    examples = []
    labels = []

    file_names = glob.glob('./emails/spam-{0}/*.txt'.format(type_))
    for n in file_names:
        f = open(n)

    file_names = glob.glob('./emails/nonspam-{0}/*.txt'.format(type_))
    for n in file_names:
        f = open(n)

    return examples, labels
In [4]:
train_docs, train_labels = get_labeled_data('train')
test_docs, test_labels = get_labeled_data('test')

# Train model
print('Number of training examples: {0}'.format(len(train_labels)))
print('Number of test examples: {0}'.format(len(test_labels)))
print('Training model...')
nb = BernoulliNBTextClassifier()
nb.train(train_docs, train_labels)
print('Training complete!')
print('Number of features found: {0}'.format(len(nb.features)))

# Simple error test metric
print('Testing model...')
f = lambda doc, l: 1. if nb.predict(doc) != l else 0.
num_missed = sum([f(doc, l) for doc, l in zip(test_docs, test_labels)])

N = len(test_labels) * 1.
error_rate = round(100. * (num_missed / N), 3)

print('Error rate of {0}% ({1}/{2})'.format(error_rate, int(num_missed), int(N)))
Number of training examples: 700
Number of test examples: 260
Training model...
Training complete!
Number of features found: 19100
Testing model...
Error rate of 4.231% (11/260)


We went over all of the mathematics and code needed to build a simple naive Bayes classifier. There are several ways we can improve. First, performing some type of feature selection is appropiate. We might also entertain the idea of modeling our features differently. That is, not using binary features, but perhaps, maybe modeling the counts of words in a document. These things aside, I think these last two articles at least laid the foundation to build on.

Thanks for reading!