Collaborative Filtering

by Matt Johnson - Thu 15 February 2018
Tags: #machine learning

This article is a brief intro into building and evaluating a simple recommender system. This is a rich topic, and there are lots of models to choose from. Each model has its own set of assumptions of course. Let's introduce one of the basic problems:

The problem is that we are given a user-item matrix where each user may or may not have rated each movie. We'd like to predict the missing ratings. Why? Well, if we could accurately predict each missing rating for the user, we could recommend movies they haven't rated (by sorting and/or weighting the predicted rating of their unrated items). Let's start with a toy example:

In [6]:
import pandas as pd
import numpy as np
In [7]:
R = np.array([[1., 1., 3., 4., 3.],
              [1., 0., 3., 5., 4.],
              [1., 0., 4., 4., 5.],
              [5., 5., 2., 1., 0.],
              [5., 3., 1., 0., 0.],
              [4., 4., 0., 1., 0.],
              [5., 1., 1., np.nan, np.nan]])

df_movie_ratings = pd.DataFrame(R, index=['Anna', 'Alice', 'Julie', 'Rob', 'Matt', 'Mike', 'Andy'],
                     columns=['Indiana Jones', 'Jurrasic Park', 'Love Actually', 'You\'ve Got Mail', 'The Notebook'])
df_movie_ratings
Out[7]:
Indiana Jones Jurrasic Park Love Actually You've Got Mail The Notebook
Anna 1.0 1.0 3.0 4.0 3.0
Alice 1.0 0.0 3.0 5.0 4.0
Julie 1.0 0.0 4.0 4.0 5.0
Rob 5.0 5.0 2.0 1.0 0.0
Matt 5.0 3.0 1.0 0.0 0.0
Mike 4.0 4.0 0.0 1.0 0.0
Andy 5.0 1.0 1.0 NaN NaN

Here, we have 7 users and ratings for 5 movies. But for Andy, we are missing two ratings. If we HAD to pick which of the two movies to recommend (You've Got Mail vs The Notebook), which one should we recommend? So now, we want to predict the rating for "You've got mail". Here's an idea, we can find users that have rated "You've got mail" and take the mean rating of these.

BUT, we can make our prediction a little smarter. We can instead, take a weighted average. Such that the weights are calculated by how similar a user is to our target user (Andy in this case). And, we'll only do this for top K most similar users (even with this approach, the top K could still include users that aren't really similar). Let $r_{ij}$ denote the rating of the ith user on the jth item. We'll denote $\hat{r_{ij}}$ as the predicted rating. Let $f_s(i,k)$ by the similarity metric between user i and user k. And, let $N(i)$ represent the set of $k$ users that are most similar to user i. Then, mathematically, we estimate the missing rating by:

$\hat{r_{ij}} = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * r_{kj}$

We are free to choose our similarity metric $f_s$. An example could be a Pearson correlation. One thing to observe here, is what if our target user tends to be a conservative rater, but their neighbors tend to rate items liberally? Basically, we can imagine a cluster of points centered around a rating of 2.5 for some, but say 3 for others. To enforce a more normalized comparison, we should adust the rating for each user such that they all have the same average rating (say 0). Thus, we adjust our calculation:

$\hat{r_{ij}} - \mu_i = \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$

or

$\hat{r_{ij}} = \mu_i + \frac{1}{\sum_{k \in N(i)} f_s(i,k)} \sum_{k \in N(i)} f_s(i,k) * (r_{kj} - \mu_k)$

There are some additional points that have to be considered:

  • Do we just take the K nearest neighbors? Always taking top K doesn't guarantee these K neighbors are necessarily similar.
  • How to guarantee similarity, what is the cutoff we should use? If we are too stringent here, we won't be able to leverage any collaborative filtering.
  • How many items in common should a user have in common with a target user? We need at least two items to calculate a Pearson correlation, but our statistical estimate is more reliable as number of data samples increases.

Test Data

The GroupLens Research is a research lab at the University of Minnesota. They specialize in recommender systems and provide some datasets for us to practice with. These same people also developed a cousera (https://www.coursera.org/specializations/recommender-systems). One of their projects, was to create a web application named "MovieLens". People can provide their ratings and be recommended movies to watch. What's great for us, is that we have real data to play with (https://grouplens.org/datasets/movielens/). Let's take the small MovieLens dataset. This has 100,000 ratings by 700 users for 9,000 movies.

In [8]:
df_movies = pd.read_csv('../ml-latest-small/movies.csv')
df_movies.head()
Out[8]:
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
In [9]:
df_ratings = pd.read_csv('../ml-latest-small/ratings.csv')
df_ratings.head()
Out[9]:
userId movieId rating timestamp
0 1 31 2.5 1260759144
1 1 1029 3.0 1260759179
2 1 1061 3.0 1260759182
3 1 1129 2.0 1260759185
4 1 1172 4.0 1260759205
In [10]:
df_links = pd.read_csv('../ml-latest-small/links.csv')
df_links.head()
Out[10]:
movieId imdbId tmdbId
0 1 114709 862.0
1 2 113497 8844.0
2 3 113228 15602.0
3 4 114885 31357.0
4 5 113041 11862.0

To get this into a user-item matrix, I leveraged the idea to pivot from (https://beckernick.github.io/matrix-factorization-recommender/). Let's do this:

In [11]:
R_df = df_ratings.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(np.nan)
print(R_df.shape)
R_df.head()
(671, 9066)
Out[11]:
movieId 1 2 3 4 5 6 7 8 9 10 ... 161084 161155 161594 161830 161918 161944 162376 162542 162672 163949
userId
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 9066 columns

So this is 671 users and 9066 movies. How many ratings?

In [12]:
# How many non-nan entries?
np.count_nonzero(~np.isnan(R_df))
Out[12]:
100004

We have over 100,000 ratings. Nice. Next, we want to divide up our data into a training/test set.

Training / Test Data

For this article, we won't leverage cross-validation, but instead, we'll just simply randomly divide up our known ratings into 80% for training, and 20% for testing. We'll need a couple of functions to do this.

In [13]:
def get_non_nan_cell_locations(df):
    # Index positions of cells
    cells = list(zip(*np.where(df.notnull())))
    
    # Row/Column labels of cells
    v = []
    for i,j in cells:
        row_label = df.index[i]
        column_label = df.columns.values[j]
        v.append((row_label, column_label))
    return v
In [14]:
# For reproducibility
np.random.seed(42)

# List of (i,j) where i,j is the
# row and column label
cells = get_non_nan_cell_locations(R_df)

# Make a random shuffle
np.random.shuffle(cells)

# Take first 20% locations. DO NOT use for training.
cell_test_locations = cells[:int(0.2 * len(cells))]

# Get the predicted ratings of each of these cells
y_test = np.array([R_df.loc[i][j] for i,j in cell_test_locations])

# Create a training set
# such that each test location
# is set to np.nan
R_df_train = R_df.copy()
for c in cell_test_locations:
    R_df_train.loc[c] = np.nan

Let's look at our training dataframe:

In [15]:
R_df_train.head()
Out[15]:
movieId 1 2 3 4 5 6 7 8 9 10 ... 161084 161155 161594 161830 161918 161944 162376 162542 162672 163949
userId
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 9066 columns

Performance Metric

A performance metric is really, really important. It captures what we are trying to optimize for. For example, if recommending products, you'd probably want your metric to include optimizing for items that will increase profits in addition to products the user would actually want (in addition to lots of other things). For this article, we use RMSE (root mean squared error) as a simple evaluation metric:

$RMSE = \sqrt{\frac{1}{N}\sum_i^N (\hat{y_i} - y_i)^2}$

In [16]:
def get_rsme(y_hat, y):
    return np.mean(np.sqrt(np.power((y[:len(y_hat)] - y_hat), 2))) 

Now, let's try a simple neighbor model using Pearson similarity. First, we'll compute the similarity matrix across all users.

In [17]:
R_sim = R_df_train.T.corr(method='pearson')

Next, we'll compute the mean rating of each user. This will be useful in demonstrating how mean centering can provide benefits.

In [18]:
R_mean = R_df_train.mean(axis=1)
In [19]:
R_sim.head()
Out[19]:
userId 1 2 3 4 5 6 7 8 9 10 ... 662 663 664 665 666 667 668 669 670 671
userId
1 1.000000 NaN NaN 0.068752 NaN NaN -0.577350 NaN NaN NaN ... NaN NaN NaN 0.500000 NaN NaN NaN NaN NaN NaN
2 NaN 1.000000 0.275839 0.099015 NaN NaN 0.487950 0.408248 0.133631 -1.0 ... -0.023620 -1.0 0.239485 -0.386695 -0.212878 -0.351668 NaN NaN -0.636364 0.298807
3 NaN 0.275839 1.000000 NaN -0.080845 -1.000000 -0.145693 -0.148936 -0.816497 NaN ... -0.720577 NaN 0.499761 -0.134715 0.698771 NaN NaN NaN -1.000000 NaN
4 0.068752 0.099015 NaN 1.000000 -0.009627 0.613941 0.196699 0.577704 NaN NaN ... 0.127000 NaN 0.165783 0.449601 -0.022076 0.683130 0.57735 0.415227 -0.333333 0.307570
5 NaN NaN -0.080845 -0.009627 1.000000 0.333333 -0.790569 -0.353726 0.310087 1.0 ... 0.408937 NaN -0.146061 0.000000 -0.288675 -0.577350 NaN NaN -1.000000 -0.301753

5 rows × 671 columns

In [20]:
def get_pred_using_pearson(R, cell_indices, k=5, p=0.4, R_sim=None, R_mean=None, use_mean_centering=True):
    
    R_ = R.copy()
    
    # Calculate similarity matrix
    # and rely on cython for this
    if R_sim is None:
        R_sim = R.T.corr()
        
    if R_mean is None:
        R_mean = R_df_train.mean(axis=1)
    
    if use_mean_centering:
        R_ = R_.sub(R_mean, axis=0)
        
    pred_cell_values = np.empty((len(cell_indices),))
    
    # map item to users that have a prediction value
    cols = set([j for i,j in cell_indices])
    users_with_j_rating = {j: R[j][R[j].notnull()].index for j in cols}
    
    # For each cell we wish to predict
    for i, (target_user, j) in enumerate(cell_indices):
        
        # Find users that have a prediction
        # for this item "j". Note our
        # target user won't be included in
        # in this series
        users = users_with_j_rating[j]
        
        # Similarities for target_user to all other users
        target_user_sim = R_sim.loc[target_user]
        
        # Reduce this similarity series by users
        # that have rated this item "j"
        target_user_sim = target_user_sim.loc[users]
            
        # Now, find K nearest neighbors, defines weight
        nn_series = target_user_sim.nlargest(k)
        
        # Ensure threshold is satisifed
        nn_series = nn_series[nn_series > p]
        
        # Weight vector
        w = nn_series.values
        
        # Can't locate neighbors, predict using mean
        if len(w) == 0:
            pred_cell_values[i] = R_mean.loc[target_user]
        else:
            # Predicted values from users
            r_vals = np.array([R_.loc[u][j] for u in nn_series.index])
             
            pred_val = r_vals.dot(w) / w.sum()
            
            if use_mean_centering:
                pred_cell_values[i] = pred_val + R_mean.loc[target_user]
            else:
                pred_cell_values[i] = pred_val
        
    return pred_cell_values
In [21]:
# Take 10 closest neighbors
k=10

# Ensure there is at least a similarity of 0.6
p=0.6

y_hat_pearson_no_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=False)
y_hat_pearson_no_centering
Out[21]:
array([3.5       , 3.28961749, 3.55909091, ..., 3.67222222, 3.22815534,
       2.84615385])
In [209]:
# Caculate the error
pearson_no_centering_metric = get_rsme(y_hat_pearson_no_centering, y_test)
pearson_no_centering_metric
Out[209]:
0.8114231052297071
In [22]:
# Take 10 closest neighbors
k=10

# Ensure there is a similarity of at least 0.6
p=0.6

y_hat_pearson_with_centering = get_pred_using_pearson(R_df_train, cell_test_locations, k, p, R_sim, R_mean, use_mean_centering=True)
y_hat_pearson_with_centering
Out[22]:
array([3.98443377, 3.28961749, 3.55909091, ..., 3.67222222, 3.22815534,
       2.84615385])
In [23]:
# Caculate the error
pearson_with_centering_metric = get_rsme(y_hat_pearson_with_centering, y_test)
pearson_with_centering_metric
Out[23]:
0.7821459799193352

OK, results suggest that mean centering definitely helps. Remember, these are point estimates (no cross validation), but the results definitely suggest we are moving in the right direction.

To make sure we are not fooling ourselves, let's create two other models. One model randomly classifiers ratings, the other, just uses the mean as the prediction

Random classifier

In [24]:
def get_missing_values_random(R_df_train, cell_test_locations):
    return np.random.randint(low=1, high=6, size=(len(cell_test_locations),)) 
In [25]:
y_hat_random = get_missing_values_random(R_df_train, cell_test_locations)
random_metric = get_rsme(y_hat_random, y_test)
random_metric
Out[25]:
1.499375

Mean Classifier

In [26]:
def get_missing_values_mean(R_mean, cell_test_locations):
    return np.array([R_mean.loc[i] for i,j in cell_test_locations])
In [27]:
y_hat_mean = get_missing_values_mean(R_mean, cell_test_locations)
mean_metric = get_rsme(y_hat_mean, y_test)
mean_metric
Out[27]:
0.7489118307578099

Given the techniques tried so far, simply taking the mean provides the best result.

Comments