Multinomial Naive Bayes Classifier

by Matt Johnson - Sun 26 June 2016
Tags: #machine learning #NLP

From this article we introduced the naive Bayes classifier. There are many types of naive Bayes classifiers--depending on how we model $p(\vec{x}|Y)$. In this article, we model $p(\vec{x}|Y)$ as a multinomial distribution. This gives rise to the multinomial naive Bayes classifier.

We wish to calculate $p(Y|\vec{x})$, where $Y$ is a discrete random variable that encodes the different possible classes for a given feature vector $\vec{x}$. By Bayes' rule:

$p(Y|\vec{x}) \propto p(Y)p(\vec{x}|Y)$

$p(Y)$ can be modeled as a categorical distribution, and in this article, we showed how this is done.

Here, we model $\vec{x}$ as a vector with $P$ components. Where each component is the count of a particular feature for a given sample. Thus, each component is a random variable $X = \{0,1,2,..,M_i\}$, where $M_i$ is total number of draws (not features) for a given sample (e.g., the number of words in a document). Let $f_j$ be the number of times the jth feature occurs for a given sample.

Suppose, for a given document, we perform $M_i$ draws from our bag of features with replacement and each draw is independent. What is the probability we'd observe a given count vector $\vec{x}$? Well, the probability of observing the $j$th feature on the first draw is a categorical distribution (for discrete data):

$\text{Cat}(x_j | \, \vec{\theta}) = \Pi_i^{P} \theta_{i=1}^{[i=j]}$

This distribution applies for the next draw as well. Thus, generally, we get:

$\Pi_{j=1}^{P} \theta_{j}^{f_j}$

However, $\vec{x}$ correponds to many different ways these draws features can be arranged. We need to count the total number of ways we can observe this outcome. To account for all of these ways, we multiply by the following:

$\frac{M_i!}{f_1! f_2! ... f_P!}$

Putting this together, we arrive at the multinomial distribution

$\text{Mult}(\vec{x} \, | \vec{\theta}) = p(\vec{x} \, | \vec{\theta}) = \frac{M_i!}{ \Pi_{j=1}^{P} f_j!} \Pi_{j=1}^{P} \theta_{j}^{f_j}$

This is a likelihood function of $\vec{\theta}$ and as probably guessed, we can use MLE to estimate it.

Calculating Likelihood Function for $\vec{\theta}$

Suppose, for a given class, we have the following data:

$D = \{ \vec{x}^{(1)}, \vec{x}^{(2)}, ... \vec{x}^{(N)} \}$.

The probability of observing these data are:

$p(D|\vec{\theta}) = \Pi_{i=1}^{N} \Pi_{j=1}^{P} \frac{M_i!}{f_{i1}! ... f_{ip}!} \theta_{j}^{f_{ij}}$

Taking the log, we obtain the log likelihood function:

$\text{log} \, (p) = \sum_{i=1}^{N} \sum_{j=1}^{P} f_{ij} \, \text{log}(\theta_j) + \text{log}(M_i) - \text{log}(f_{i1}!...f_{ip}!)$

subject to the constraint

$\sum_{i=1}^{P} \theta_i = 1$

Langrange Multiplier

We can use a Langrange multiplier to solve this problem. We ignore terms that do not depend on $\vec{\theta}$ or $\lambda$ since when we take the partial derivative with respect to either, they will disappear:

$L = \sum_{i=1}^{N} \sum_{j=1}^{P} f_{ij}\text{log}(\theta_{j}) + \lambda (1 - \sum_{j=1}^{P} \theta_{j})$

Taking the partial derivative with respect to $\theta_{k}$ yields:

$\frac{\partial L}{\partial \theta_k} = \sum_{i=1}^{N} \frac{f_{ik}}{\theta_k} - \lambda = 0$


$\sum_i^{N} f_{ik} = \theta_k \lambda$

To solve for $\lambda$, we can sum both sides:

$\sum_j^{P} \sum_i^{N} f_{ij} = \lambda \sum_j^{P} \theta_j$


$ \lambda = \sum_j^{P} \sum_i^{N} f_{ij}$

Putting this back gives:

$\theta_k = \frac{\sum_i^{N} f_{ik}}{\sum_j^{P} \sum_i^{N} f_{ij} }$

If we denote $M_{k}$ as the total number of draws for a given class, and $M_{kj}$ as the total number of draws for a given class and feature $j$. We can rewrite:

$\theta_k = \frac{M_{kj}}{M_{k}}$

Putting it All Together

Now what we have an esimate for $\theta_k$, we can put our equations together:

$\text{argmax}_k \, p(y_k) \frac{M!}{ \Pi_{j=1}^{P} f_j!} \Pi_{j=1}^{P} \theta_{j}^{f_j}$

If we are only interested in the mode of the posterior, the terms invariant across different classes don't matter. Thus:

$\text{argmax}_k \, p(y_k) \Pi_{j=1}^{P} \theta_{j}^{f_j}$

We take the log to prevent numerical underflow:

$\text{argmax}_k \, \text{log}(\theta_k) \, + \sum_j f_j \, \text{log}(\theta_j) $

for each class. We then take the class with the highest score