This article is a brief intro into building and evaluating a simple recommender system. This is a rich topic, and there are lots of models to choose from. Each model has its own set of assumptions of course. Let's introduce one of the basic problems:

The problem is that we are given a user-item matrix where each user may or may not have rated each movie. We'd like to predict the missing ratings. Why? Well, if we could accurately predict each missing rating for the user, we could recommend movies they haven't rated (by sorting and/or weighting the predicted rating of their unrated items). Let's start with a toy example:

In [6]:

import pandas as pd
import numpy as np

In [7]:

R = np.array([[1., 1., 3., 4., 3.],
              [1., 0., 3., 5., 4.],
              [1., 0., 4., 4., 5.],
              [5., 5., 2., 1., 0.],
              [5., 3., 1., 0., 0.],
              [4., 4., 0., 1., 0.],
              [5., 1., 1., np.nan, np.nan]])

df_movie_ratings = pd.DataFrame(R, index=['Anna', 'Alice', 'Julie', 'Rob', 'Matt', 'Mike', 'Andy'],
                     columns=['Indiana Jones', 'Jurrasic Park', 'Love Actually', 'You\'ve Got Mail', 'The Notebook'])
df_movie_ratings

Out[7]:

	Indiana Jones	Jurrasic Park	Love Actually	You've Got Mail	The Notebook
Anna	1.0	1.0	3.0	4.0	3.0
Alice	1.0	0.0	3.0	5.0	4.0
Julie	1.0	0.0	4.0	4.0	5.0
Rob	5.0	5.0	2.0	1.0	0.0
Matt	5.0	3.0	1.0	0.0	0.0
Mike	4.0	4.0	0.0	1.0	0.0
Andy	5.0	1.0	1.0	NaN	NaN

Here, we have 7 users and ratings for 5 movies. But for Andy, we are missing two ratings. If we HAD to pick which of the two movies to recommend (You've Got Mail vs The Notebook), which one should we recommend? So now, we want to predict the rating for "You've got mail". Here's an idea, we can find users that have rated "You've got mail" and take the mean rating of these.

BUT, we can make our prediction a little smarter. We can instead, take a weighted average. Such that the weights are calculated by how similar a user is to our target user (Andy in this case). And, we'll only do this for top K most similar users (even with this approach, the top K could still include users that aren't really similar). Let $r_{ij}$ denote the rating of the ith user on the jth item. We'll denote $\hat{r_{ij}}$ as the predicted rating. Let $f_s(i,k)$ by the similarity metric between user i and user k. And, let $N(i)$ represent the set of $k$ users that are most similar to user i. Then, mathematically, we estimate the missing rating by:

$\hat{r_{ij}} = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * r_{kj}$

We are free to choose our similarity metric $f_s$. An example could be a Pearson correlation. One thing to observe here, is what if our target user tends to be a conservative rater, but their neighbors tend to rate items liberally? Basically, we can imagine a cluster of points centered around a rating of 2.5 for some, but say 3 for others. To enforce a more normalized comparison, we should adust the rating for each user such that they all have the same average rating (say 0). Thus, we adjust our calculation:

$\hat{r_{ij}} - \mu_i = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$

or

$\hat{r_{ij}} = \mu_i + \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$

There are some additional points that have to be considered:

Do we just take the K nearest neighbors? Always taking top K doesn't guarantee these K neighbors are necessarily similar.

How to guarantee similarity, what is the cutoff we should use? If we are too stringent here, we won't be able to leverage any collaborative filtering.

How many items in common should a user have in common with a target user? We need at least two items to calculate a Pearson correlation, but our statistical estimate is more reliable as number of data samples increases.

Test Data¶

The GroupLens Research is a research lab at the University of Minnesota. They specialize in recommender systems and provide some datasets for us to practice with. These same people also developed a cousera (https://www.coursera.org/specializations/recommender-systems). One of their projects, was to create a web application named "MovieLens". People can provide their ratings and be recommended movies to watch. What's great for us, is that we have real data to play with (https://grouplens.org/datasets/movielens/). Let's take the small MovieLens dataset. This has 100,000 ratings by 700 users for 9,000 movies.

In [8]:

df_movies = pd.read_csv('../ml-latest-small/movies.csv')
df_movies.head()

Out[8]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

In [9]:

df_ratings = pd.read_csv('../ml-latest-small/ratings.csv')
df_ratings.head()

Out[9]:

	userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

In [10]:

df_links = pd.read_csv('../ml-latest-small/links.csv')
df_links.head()

Out[10]:

	movieId	imdbId	tmdbId
0	1	114709	862.0
1	2	113497	8844.0
2	3	113228	15602.0
3	4	114885	31357.0
4	5	113041	11862.0

To get this into a user-item matrix, I leveraged the idea to pivot from (https://beckernick.github.io/matrix-factorization-recommender/). Let's do this:

In [11]:

R_df = df_ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(np.nan)
print(R_df.shape)
R_df.head()

(671, 9066)

Out[11]:

movieId	1	2	3	4	5	6	7	8	9	10	...	161084	161155	161594	161830	161918	161944	162376	162542	162672	163949
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 9066 columns

So this is 671 users and 9066 movies. How many ratings?

In [12]:

# How many non-nan entries?
np.count_nonzero(~np.isnan(R_df))

Out[12]:

We have over 100,000 ratings. Nice. Next, we want to divide up our data into a training/test set.

Training / Test Data¶

For this article, we won't leverage cross-validation, but instead, we'll just simply randomly divide up our known ratings into 80% for training, and 20% for testing. We'll need a couple of functions to do this.

In [13]:

def get_non_nan_cell_locations(df):
    # Index positions of cells
    cells = list(zip(*np.where(df.notnull())))
    
    # Row/Column labels of cells
    v = []
    for i,j in cells:
        row_label = df.index[i]
        column_label = df.columns.values[j]
        v.append((row_label, column_label))
    return v

In [14]:

# For reproducibility
np.random.seed(42)

# List of (i,j) where i,j is the
# row and column label
cells = get_non_nan_cell_locations(R_df)

# Make a random shuffle
np.random.shuffle(cells)

# Take first 20% locations. DO NOT use for training.
cell_test_locations = cells[:int(0.2 * len(cells))]

# Get the predicted ratings of each of these cells
y_test = np.array([R_df.loc[i][j] for i,j in cell_test_locations])

# Create a training set
# such that each test location
# is set to np.nan
R_df_train = R_df.copy()
for c in cell_test_locations:
    R_df_train.loc[c] = np.nan

Let's look at our training dataframe:

In [15]:

R_df_train.head()

Out[15]:

movieId	1	2	3	4	5	6	7	8	9	10	...	161084	161155	161594	161830	161918	161944	162376	162542	162672	163949
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 9066 columns

Performance Metric¶

A performance metric is really, really important. It captures what we are trying to optimize for. For example, if recommending products, you'd probably want your metric to include optimizing for items that will increase profits in addition to products the user would actually want (in addition to lots of other things). For this article, we use RMSE (root mean squared error) as a simple evaluation metric:

$RMSE = \sqrt{\frac{1}{N}\sum_i^N (\hat{y_i} - y_i)^2}$

In [16]:

def get_rsme(y_hat, y):
    return np.mean(np.sqrt(np.power((y[:len(y_hat)] - y_hat), 2)))

Now, let's try a simple neighbor model using Pearson similarity. First, we'll compute the similarity matrix across all users.

In [17]:

R_sim = R_df_train.T.corr(method='pearson')

Next, we'll compute the mean rating of each user. This will be useful in demonstrating how mean centering can provide benefits.

In [18]:

R_mean = R_df_train.mean(axis=1)

In [19]:

R_sim.head()

Out[19]:

userId	1	2	3	4	5	6	7	8	9	10	...	662	663	664	665	666	667	668	669	670	671
userId
1	1.000000	NaN	NaN	0.068752	NaN	NaN	-0.577350	NaN	NaN	NaN	...	NaN	NaN	NaN	0.500000	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	1.000000	0.275839	0.099015	NaN	NaN	0.487950	0.408248	0.133631	-1.0	...	-0.023620	-1.0	0.239485	-0.386695	-0.212878	-0.351668	NaN	NaN	-0.636364	0.298807
3	NaN	0.275839	1.000000	NaN	-0.080845	-1.000000	-0.145693	-0.148936	-0.816497	NaN	...	-0.720577	NaN	0.499761	-0.134715	0.698771	NaN	NaN	NaN	-1.000000	NaN
4	0.068752	0.099015	NaN	1.000000	-0.009627	0.613941	0.196699	0.577704	NaN	NaN	...	0.127000	NaN	0.165783	0.449601	-0.022076	0.683130	0.57735	0.415227	-0.333333	0.307570
5	NaN	NaN	-0.080845	-0.009627	1.000000	0.333333	-0.790569	-0.353726	0.310087	1.0	...	0.408937	NaN	-0.146061	0.000000	-0.288675	-0.577350	NaN	NaN	-1.000000	-0.301753

5 rows × 671 columns

In [20]:

def get_pred_using_pearson(R, cell_indices, k=5, p=0.4, R_sim=None, R_mean=None, use_mean_centering=True):
    
    R_ = R.copy()
    
    # Calculate similarity matrix
    # and rely on cython for this
    if R_sim is None:
        R_sim = R.T.corr()
        
    if R_mean is None:
        R_mean = R_df_train.mean(axis=1)
    
    if use_mean_centering:
        R_ = R_.sub(R_mean, axis=0)
        
    pred_cell_values = np.empty((len(cell_indices),))
    
    # map item to users that have a prediction value
    cols = set([j for i,j in cell_indices])
    users_with_j_rating = {j: R[j][R[j].notnull()].index for j in cols}
    
    # For each cell we wish to predict
    for i, (target_user, j) in enumerate(cell_indices):
        
        # Find users that have a prediction
        # for this item "j". Note our
        # target user won't be included in
        # in this series
        users = users_with_j_rating[j]
        
        # Similarities for target_user to all other users
        target_user_sim = R_sim.loc[target_user]
        
        # Reduce this similarity series by users
        # that have rated this item "j"
        target_user_sim = target_user_sim.loc[users]
            
        # Now, find K nearest neighbors, defines weight
        nn_series = target_user_sim.nlargest(k)
        
        # Ensure threshold is satisifed
        nn_series = nn_series[nn_series > p]
        
        # Weight vector
        w = nn_series.values
        
        # Can't locate neighbors, predict using mean
        if len(w) == 0:
            pred_cell_values[i] = R_mean.loc[target_user]
        else:
            # Predicted values from users
            r_vals = np.array([R_.loc[u][j] for u in nn_series.index])
             
            pred_val = r_vals.dot(w) / w.sum()
            
            if use_mean_centering:
                pred_cell_values[i] = pred_val + R_mean.loc[target_user]
            else:
                pred_cell_values[i] = pred_val
        
    return pred_cell_values

In [21]:

# Take 10 closest neighbors
k=10

# Ensure there is at least a similarity of 0.6
p=0.6

y_hat_pearson_no_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=False)
y_hat_pearson_no_centering

Out[21]:

array([3.5       , 3.28961749, 3.55909091, ..., 3.67222222, 3.22815534,
       2.84615385])

In [209]:

# Caculate the error
pearson_no_centering_metric = get_rsme(y_hat_pearson_no_centering, y_test)
pearson_no_centering_metric

Out[209]:

0.8114231052297071

In [22]:

# Take 10 closest neighbors
k=10

# Ensure there is a similarity of at least 0.6
p=0.6

y_hat_pearson_with_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=True)
y_hat_pearson_with_centering

Out[22]:

array([3.98443377, 3.28961749, 3.55909091, ..., 3.67222222, 3.22815534,
       2.84615385])

In [23]:

# Caculate the error
pearson_with_centering_metric = get_rsme(y_hat_pearson_with_centering, y_test)
pearson_with_centering_metric

Out[23]:

0.7821459799193352

OK, results suggest that mean centering definitely helps. Remember, these are point estimates (no cross validation), but the results definitely suggest we are moving in the right direction.

To make sure we are not fooling ourselves, let's create two other models. One model randomly classifiers ratings, the other, just uses the mean as the prediction

Random classifier¶

In [24]:

def get_missing_values_random(R_df_train, cell_test_locations):
    return np.random.randint(low=1, high=6, size=(len(cell_test_locations),))

In [25]:

y_hat_random = get_missing_values_random(R_df_train, cell_test_locations)
random_metric = get_rsme(y_hat_random, y_test)
random_metric

Out[25]:

1.499375

Mean Classifier¶

In [26]:

def get_missing_values_mean(R_mean, cell_test_locations):
    return np.array([R_mean.loc[i] for i,j in cell_test_locations])

In [27]:

y_hat_mean = get_missing_values_mean(R_mean, cell_test_locations)
mean_metric = get_rsme(y_hat_mean, y_test)
mean_metric

Out[27]:

0.7489118307578099

Given the techniques tried so far, simply taking the mean provides the best result.

Collaborative Filtering

Test Data¶

Training / Test Data¶

Performance Metric¶

Random classifier¶

Mean Classifier¶

Comments