Collaborative Filtering
This article is a brief intro into building and evaluating a simple recommender system. This is a rich topic, and there are lots of models to choose from. Each model has its own set of assumptions of course. Let's introduce one of the basic problems:
The problem is that we are given a user-item matrix where each user may or may not have rated each movie. We'd like to predict the missing ratings. Why? Well, if we could accurately predict each missing rating for the user, we could recommend movies they haven't rated (by sorting and/or weighting the predicted rating of their unrated items). Let's start with a toy example:
import pandas as pd
import numpy as np
R = np.array([[1., 1., 3., 4., 3.],
[1., 0., 3., 5., 4.],
[1., 0., 4., 4., 5.],
[5., 5., 2., 1., 0.],
[5., 3., 1., 0., 0.],
[4., 4., 0., 1., 0.],
[5., 1., 1., np.nan, np.nan]])
df_movie_ratings = pd.DataFrame(R, index=['Anna', 'Alice', 'Julie', 'Rob', 'Matt', 'Mike', 'Andy'],
columns=['Indiana Jones', 'Jurrasic Park', 'Love Actually', 'You\'ve Got Mail', 'The Notebook'])
df_movie_ratings
Here, we have 7 users and ratings for 5 movies. But for Andy, we are missing two ratings. If we HAD to pick which of the two movies to recommend (You've Got Mail vs The Notebook), which one should we recommend? So now, we want to predict the rating for "You've got mail". Here's an idea, we can find users that have rated "You've got mail" and take the mean rating of these.
BUT, we can make our prediction a little smarter. We can instead, take a weighted average. Such that the weights are calculated by how similar a user is to our target user (Andy in this case). And, we'll only do this for top K most similar users (even with this approach, the top K could still include users that aren't really similar). Let $r_{ij}$ denote the rating of the ith user on the jth item. We'll denote $\hat{r_{ij}}$ as the predicted rating. Let $f_s(i,k)$ by the similarity metric between user i and user k. And, let $N(i)$ represent the set of $k$ users that are most similar to user i. Then, mathematically, we estimate the missing rating by:
$\hat{r_{ij}} = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * r_{kj}$
We are free to choose our similarity metric $f_s$. An example could be a Pearson correlation. One thing to observe here, is what if our target user tends to be a conservative rater, but their neighbors tend to rate items liberally? Basically, we can imagine a cluster of points centered around a rating of 2.5 for some, but say 3 for others. To enforce a more normalized comparison, we should adust the rating for each user such that they all have the same average rating (say 0). Thus, we adjust our calculation:
$\hat{r_{ij}} - \mu_i = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$
or
$\hat{r_{ij}} = \mu_i + \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$
There are some additional points that have to be considered:
- Do we just take the K nearest neighbors? Always taking top K doesn't guarantee these K neighbors are necessarily similar.
- How to guarantee similarity, what is the cutoff we should use? If we are too stringent here, we won't be able to leverage any collaborative filtering.
- How many items in common should a user have in common with a target user? We need at least two items to calculate a Pearson correlation, but our statistical estimate is more reliable as number of data samples increases.
Test Data¶
The GroupLens Research is a research lab at the University of Minnesota. They specialize in recommender systems and provide some datasets for us to practice with. These same people also developed a cousera (https://www.coursera.org/specializations/recommender-systems). One of their projects, was to create a web application named "MovieLens". People can provide their ratings and be recommended movies to watch. What's great for us, is that we have real data to play with (https://grouplens.org/datasets/movielens/). Let's take the small MovieLens dataset. This has 100,000 ratings by 700 users for 9,000 movies.
df_movies = pd.read_csv('../ml-latest-small/movies.csv')
df_movies.head()
df_ratings = pd.read_csv('../ml-latest-small/ratings.csv')
df_ratings.head()
df_links = pd.read_csv('../ml-latest-small/links.csv')
df_links.head()
To get this into a user-item matrix, I leveraged the idea to pivot from (https://beckernick.github.io/matrix-factorization-recommender/). Let's do this:
R_df = df_ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(np.nan)
print(R_df.shape)
R_df.head()
So this is 671 users and 9066 movies. How many ratings?
# How many non-nan entries?
np.count_nonzero(~np.isnan(R_df))
We have over 100,000 ratings. Nice. Next, we want to divide up our data into a training/test set.
Training / Test Data¶
For this article, we won't leverage cross-validation, but instead, we'll just simply randomly divide up our known ratings into 80% for training, and 20% for testing. We'll need a couple of functions to do this.
def get_non_nan_cell_locations(df):
# Index positions of cells
cells = list(zip(*np.where(df.notnull())))
# Row/Column labels of cells
v = []
for i,j in cells:
row_label = df.index[i]
column_label = df.columns.values[j]
v.append((row_label, column_label))
return v
# For reproducibility
np.random.seed(42)
# List of (i,j) where i,j is the
# row and column label
cells = get_non_nan_cell_locations(R_df)
# Make a random shuffle
np.random.shuffle(cells)
# Take first 20% locations. DO NOT use for training.
cell_test_locations = cells[:int(0.2 * len(cells))]
# Get the predicted ratings of each of these cells
y_test = np.array([R_df.loc[i][j] for i,j in cell_test_locations])
# Create a training set
# such that each test location
# is set to np.nan
R_df_train = R_df.copy()
for c in cell_test_locations:
R_df_train.loc[c] = np.nan
Let's look at our training dataframe:
R_df_train.head()
Performance Metric¶
A performance metric is really, really important. It captures what we are trying to optimize for. For example, if recommending products, you'd probably want your metric to include optimizing for items that will increase profits in addition to products the user would actually want (in addition to lots of other things). For this article, we use RMSE (root mean squared error) as a simple evaluation metric:
$RMSE = \sqrt{\frac{1}{N}\sum_i^N (\hat{y_i} - y_i)^2}$
def get_rsme(y_hat, y):
return np.mean(np.sqrt(np.power((y[:len(y_hat)] - y_hat), 2)))
Now, let's try a simple neighbor model using Pearson similarity. First, we'll compute the similarity matrix across all users.
R_sim = R_df_train.T.corr(method='pearson')
Next, we'll compute the mean rating of each user. This will be useful in demonstrating how mean centering can provide benefits.
R_mean = R_df_train.mean(axis=1)
R_sim.head()
def get_pred_using_pearson(R, cell_indices, k=5, p=0.4, R_sim=None, R_mean=None, use_mean_centering=True):
R_ = R.copy()
# Calculate similarity matrix
# and rely on cython for this
if R_sim is None:
R_sim = R.T.corr()
if R_mean is None:
R_mean = R_df_train.mean(axis=1)
if use_mean_centering:
R_ = R_.sub(R_mean, axis=0)
pred_cell_values = np.empty((len(cell_indices),))
# map item to users that have a prediction value
cols = set([j for i,j in cell_indices])
users_with_j_rating = {j: R[j][R[j].notnull()].index for j in cols}
# For each cell we wish to predict
for i, (target_user, j) in enumerate(cell_indices):
# Find users that have a prediction
# for this item "j". Note our
# target user won't be included in
# in this series
users = users_with_j_rating[j]
# Similarities for target_user to all other users
target_user_sim = R_sim.loc[target_user]
# Reduce this similarity series by users
# that have rated this item "j"
target_user_sim = target_user_sim.loc[users]
# Now, find K nearest neighbors, defines weight
nn_series = target_user_sim.nlargest(k)
# Ensure threshold is satisifed
nn_series = nn_series[nn_series > p]
# Weight vector
w = nn_series.values
# Can't locate neighbors, predict using mean
if len(w) == 0:
pred_cell_values[i] = R_mean.loc[target_user]
else:
# Predicted values from users
r_vals = np.array([R_.loc[u][j] for u in nn_series.index])
pred_val = r_vals.dot(w) / w.sum()
if use_mean_centering:
pred_cell_values[i] = pred_val + R_mean.loc[target_user]
else:
pred_cell_values[i] = pred_val
return pred_cell_values
# Take 10 closest neighbors
k=10
# Ensure there is at least a similarity of 0.6
p=0.6
y_hat_pearson_no_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=False)
y_hat_pearson_no_centering
# Caculate the error
pearson_no_centering_metric = get_rsme(y_hat_pearson_no_centering, y_test)
pearson_no_centering_metric
# Take 10 closest neighbors
k=10
# Ensure there is a similarity of at least 0.6
p=0.6
y_hat_pearson_with_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=True)
y_hat_pearson_with_centering
# Caculate the error
pearson_with_centering_metric = get_rsme(y_hat_pearson_with_centering, y_test)
pearson_with_centering_metric
OK, results suggest that mean centering definitely helps. Remember, these are point estimates (no cross validation), but the results definitely suggest we are moving in the right direction.
To make sure we are not fooling ourselves, let's create two other models. One model randomly classifiers ratings, the other, just uses the mean as the prediction
Random classifier¶
def get_missing_values_random(R_df_train, cell_test_locations):
return np.random.randint(low=1, high=6, size=(len(cell_test_locations),))
y_hat_random = get_missing_values_random(R_df_train, cell_test_locations)
random_metric = get_rsme(y_hat_random, y_test)
random_metric
Mean Classifier¶
def get_missing_values_mean(R_mean, cell_test_locations):
return np.array([R_mean.loc[i] for i,j in cell_test_locations])
y_hat_mean = get_missing_values_mean(R_mean, cell_test_locations)
mean_metric = get_rsme(y_hat_mean, y_test)
mean_metric
Given the techniques tried so far, simply taking the mean provides the best result.