Classification is a common task in machine learning. The goal is to get a computer to classify something automatically. Examples include: classifying an image of a tumor as cancerous or non-cancerous, or a bank transaction as compromised or not compromised.

In this article, we develop the mathematics to implement a particular type of classifier known as a **naive Bayes classifier**. This is a **supervised** algorithm, thus, it needs training data to work.

Let's begin.

# Overview

Naive Bayes is a probabilistic model commonly used in classification problems. The algorithm is **supervised**. The *naive* part of naives bayes comes from one of its hallmark assumptions:

*Features are assumed to be conditionally independent*

Note, the assumption is conditionally independent and *not* independent. The distinction is subtle, but important to clarify. It means:

$P(X_1, \, X_2 |Y) = P(X_1 \, | \, Y)P(X_2 \, | \, Y)$.

As a concrete example, suppose we have the following training data:

```
```

**Is_Hot,Color,Beverage**

0,White,Coffee

0,White,Coffee

1,Black,Coffee

1,Black,Coffee

0,Green,Soda

1,Black,BlackTea

1,Black,BlackTea

Let $Y$ be the beverage (e.g., coffee, soda, black tea). Let $X_1$ be if the beverage is hot, and $X_2$ be the color of the beverage. From these data we calculate:

$P(X_1 = \text{hot}, X_2 = \text{Black} \, | Y = \text{coffee}) = 0.5$

Now, let's calculate *using* conditional independence. We get:

$
P( X_1 = \text{hot}, \, X_2 = \text{Black} | \, Y = \text{coffee}) = $

$P( X_1 = \text{hot} \, | \, Y = \text{coffee}) P(X_2 = \text{Black} \, | \, Y = \text{coffee}) = $

$0.5 \times 0.5 = 0.25$

Notice how the answers disagree. Thus, from these data, our conditional independence is not justified. And if you think about it, if you know the beverage is coffee and you are told it is hot, doesn't is seem reasonable that your belief it is black go up? (especially if most cold coffees are served white--say with creamer) But this tradeoff avoids a nasty problem: that the number of parameters we must estimate scales exponentially with the number of features!

# Mathematical Setup¶

Let's construct a column vector $\vec{x}$ that represents $P$ features. An example of this column vector might be:

$\vec{x} = (X_1 \, X_2 \, X_3 ... X_P)^T$

In this vector, each $X$ is a random variable.

We let the random variable $Y$ represent the class. It can be one of $K$ discrete states. That is, $Y = \{y_1, y_2, ..., y_k\}$. Now, suppose we have $N$ training examples:

$D = \{(\vec{x_1}, y_1), (\vec{x_2}, y_2), ..., (\vec{x_N}, y_N) \}$

From these data, we wish to calculate:

$p(Y \,| \, \vec{x})$

If we try all possible $Y$ states, we can pick the one that maximizes the probability, this is known as maximum a posterior or "MAP" for short. Our belief moves from a prior belief to a posterior belief:

$p(Y) \rightarrow p(Y|\vec{x})$

# Calculating $p(Y|\vec{x})$¶

By the definition of a conditional probability, we write:

$p(Y|\vec{x}) = \frac{p(\vec{x},Y)}{p(\vec{x})}$

We need to fully calculate the *right hand side* of this equation. If you think about it, calculating $p(\vec{x})$ is not necessary if we are only interested in the mode of $p(Y|\vec{x})$. But, by not modeling $p(\vec{x})$, we lose the ability to calculate the probability distribution of $p(Y|\vec{x})$. We now get:

$p(Y|\vec{x}) \propto p(\vec{x},Y)$

## Chain Rule

This leaves us to calculate $p(\vec{x},Y)$. We can rewrite this using the chain rule:

$p(X_1, X_2, ..., X_P,Y) = p(X_1 | X_2...X_P, Y) p(X_2 | X_3...X_P, Y) p(X_P|Y)p(Y)$

(recall $\vec{x}$ is a vector that holds a bunch of random variables. Above, we have expanded it as: $\vec{x} = X_1,X_2,...X_P$)

The chain rule is exact. The problem, however, is the number of parameters we must estimate. If we assume $X$ is binary, the first term in the product requires $2^{P-1}K$ parameters. The next term requires $2^{P-2}K$ parameters. In all, we have:

$K 2^{P} - 1$

parameters to estimate. That scales exponentially with the number of features! Now comes the *naive* part. We assume features are conditionally independent. This assumption allows the following factorization:

$p(X_1 | X_2...X_P, Y) = p(X_1 \, | \, Y)$

Thus,

$p(X_1, X_2, ..., X_P, Y) = p(X_1|Y)p(X_2|Y) p(X_p|Y)p(Y)$

or more compactly:

$p(X,Y) = p(Y) \; \Pi_{j=1}^{P} \; p(X_j \, | \, Y)$

Thus,

$p(Y \, | \, \vec{x}) \propto p(Y) \; \Pi_{j = 1}^{p} \; p(X_j \, | \, Y)$

From this, we try all possible values of $Y$ and take the one that scores the highest. For a tie, we can randomly pick from the tied outcomes. How many parameters do we now have to esimate? Well, there are $K$ classes, and for each class, we have $P$ features. Thus in total we have $K-1 + KP$ parameters. Assuming we had 4 classes and 25 features, we went from 134,217,727 to only 103 parameters to esimate!

# Numerical Detail

A numerical detail, is that we may be victim to numerical underflow when $p$ is large:

$p(Y \, | \, \vec{x}) \propto p(Y) \; \Pi_{j = 1}^{p} \; p(X_j \, | \, Y)$

As we keep mutiplying terms (remember, terms are less than or equal to 1), the running number gets smaller and smaller and smaller.... A trick is to take the log:

$\text{log}\,p(Y) + \sum_{j = 1}^{p} \; \text{log} \, p(X_j \, | \, Y)$

In fact, we don't have to transform it back since we are only interested in the mode. The log preserves order since it is a monotonic function, that is, $\text{log}(x) \leq \text{log}(y)$ for $x \leq y$.

# Naive Bayes Models

Now, we have to model $p(Y)$ and each $p(X_j | Y)$ term. The choice of how model these probabilities leads to the different implementations of naive Bayes classifiers.

# Bernoulli Naive Bayes Classifier

One choice, is to let $X = \{0,1\}$, with this choice, $p(X|Y)$ should be modeled as a Bernoulli distribution. Read the article here learn how to build a Bernoulli naive Bayes classifier.