General context

When we think about Netflix, we might have watched lots of movies that are science_fiction, action, horror etc. Netflix may not know these particular properties of the films you watched, but it would be able to see that other people that watched the same movies could watch other movies that you are not watching yet. By doing recommendation approach, Netflix can recommend us the contents of the movies that we have not watched before but relevant to what we liked.

This approach is called collaborative filtering. The key foundation idea is that of latent factors which decides what kinds of movies you want to watch.

Data set

Indeed, we can not have access to NEtflix's entire dataset of movie watching history, but there is a great dataset that we can yous, called MovieLen which contains tens millions of movies ranking.

from fastai.collab import *
from fastai.tabular.all import *

path = untar_data(URLs.ML_100k)

The information of the movies is structured as a table, where each column are respectively user, movie, rating and timestamp. Then, we need to indicate them when reading the file with pandas.

ratings = pd.read_csv(path/'u.data',delimiter='\t', header=None, names=['user','movie','rating','timestamp'])

ratings.head()

To have a more user-friendly interface, Figure below shows the same data cross-tabulated into a human-friendly table. As the example, the empty cells in the table are the things that we would like our model to fill in based on the other informations.

Crosstab of movies and users

Basically, our objective is to recommend the movies to the people that might like them. In order to weight for each movie, how much the match of each category it is, we use the factos range between -1 and 1. For example, in oder to represent the movie The Last Skywalker for each category of science-fiction, action and old movies, we could use an array.

last_skywalker = np.array([0.98,0.9,-0.9])

Then we can score the interests of each user for each category by an array as well

user1 = np.array([0.8,0.6,-0.4])

Then, we calculate the matche between the combination which is a dot product:

(user1*last_skywalker).sum()

1.6840000000000002

Since we dont know what the latent factors are, and we dont know how to score them for each user and movie, we should learn them.

Learning the Latent factors

Step 1 of this approach is to randomly initialize some parameters. These parameters will be set as latent factors for each user and movie. For the illustrative purposes, we will use 5.

Step 2 of this approach is to calculate our predictions. By simply applying dot product of each movie with the user, by doing so, we can ontain a great match if an particular user likes a category of movies and the latent movies factor shows a lot of action.

Step 3 is to calculate our loss between our prediction and already obtained data.

With this in place, we can optimize our parameters using SGD, such as to minimize the loss.

Latent factors with crosstab

Creating the DataLoaders

movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1', usecols=(0,1), names=('movie','title'), header=None)

movies.head()

We will use merge the movies and our ratings

ratings = ratings.merge(movies)

ratings.head()

By using DataLoaders, it takes by default the first column fir user, the second column for the item and the third will be used for ratings.

dls = CollabDataLoaders.from_df(ratings,item_name='title',bs=64)

dls.show_batch()

Then, with Pytorch, we represent our movies and user latent factor tables as matrices

n_users = len(dls.classes['user'])
n_movies = len(dls.classes['title'])

n_factors=5

user_factors = torch.randn(n_users,n_factors)
movie_factors = torch.randn(n_movies,n_factors)

By looking up an index, we can find the factors of user and movie. It can be seen as a matrix product. By replacing our indices with one hot encoded vectors, we can represent it.

one_hot_3 = one_hot(3,n_users).float()
# latent factors of user 3
user_factors.t() @ one_hot_3

tensor([ 1.0129, -0.1466, -0.3618,  1.1011, -0.4564])

Collaborative Filtering from Scratch

class DotProduct(Module):
    def __init__(self,n_user,n_movies,n_factors):
        self.user_factors = Embedding(n_users,n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        
    def forward(self,x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return (users*movies).sum(dim=1)

in this class, forward is a special Pytorch method name to notify us that a new Pytoch Module has just been created.

Then, we will create a Learner to optimize the parameters. We will use the plain Leaner class here:

model = DotProduct(n_users,n_movies,50)
learn = Learner(dls,model,loss_func=MSELossFlat())

learn.fit_one_cycle(5,5e-3)

To make the model slightly better, we can force those prediction between 0 and 5. Then, we need to apply sigmoid_range, like previous post.

class DotProduct(Module):
    def __init__(self,n_user,n_movies,n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users,n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range=y_range
        
    def forward(self,x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        return sigmoid_range((users*movies).sum(dim=1),*self.y_range)

model = DotProduct(n_users,n_movies,50)
learn = Learner(dls,model,loss_func=MSELossFlat())
learn.fit_one_cycle(5,5e-3)

We will try to add bias to the weights and see what happens.

class DotProductBias(Module):
    def __init__(self,n_user,n_movies,n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users,n_factors)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.y_range=y_range
        self.user_bias = Embedding(n_users,1)
        self.movie_bias = Embedding(n_movies,1)
        
    def forward(self,x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users*movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range((users*movies).sum(dim=1),*self.y_range)

model = DotProduct(n_users,n_movies,50)
learn = Learner(dls,model,loss_func=MSELossFlat())
learn.fit_one_cycle(5,5e-3)

In stead of being better, it becomes worse because it is overfitting very quickly. So we need to find a way to train with more epoch and avoid overfitting. To do that, we will use a regularization technique which is so-called weight decay

Weight decay

One possible way to reduce the overfitting effect is to reduce the capacity of the model which is basically how much space does it have to find answers. Weight decay or, L2 regularization, consists in adding of loss function the sum of all the weights squared. Then, to reduce the whole loss function, we need to reduce the weights. Then we reduce the likelihood of the big changes in the loss. As the results, the small changes in the weight can lead to the small changes in the loss. By doing that, we can prevent the model doing overfitting that happens with very sharp changes.

The downside of limiting the weights is that we limit the space of trying the possibilities. But it generalizes better

loss_with_wd = loss + wd * (parameters**2).sum()

model = DotProductBias(n_users,n_movies,50)

learn = Learner(dls,model,loss_func=MSELossFlat())
learn.fit_one_cycle(5,5e-3,wd=0.1)

By doing weight decay, as the results, we see the training loss increase but the validation loss slightly decrease. It means that the generalization works.

Creating Embedding module

Previously, we talked about embeding layer which is a shortcut of doing matrix multiplication for us and indexing the array. We can create our own embeding layer.

class T(Module):
    def __init__(self): self.a = nn.Parameter(torch.ones(3))

By wrapping with nn.Parameter, Pytorch will assume that are parameters to be learned.

L(T().parameters())

(#1) [Parameter containing:
tensor([1., 1., 1.], requires_grad=True)]

Let's create a tensor as a parameter

def create_params(size):
    return nn.Parameter(torch.zeros(*size).normal_(0, 0.01))

Let's use this to create DotProductBias again, but without Embedding:

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = create_params([n_users, n_factors])
        self.user_bias = create_params([n_users])
        self.movie_factors = create_params([n_movies, n_factors])
        self.movie_bias = create_params([n_movies])
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors[x[:,0]]
        movies = self.movie_factors[x[:,1]]
        res = (users*movies).sum(dim=1)
        res += self.user_bias[x[:,0]] + self.movie_bias[x[:,1]]
        return sigmoid_range(res, *self.y_range)

Then we will train it again, we will see that there is no effect of embedding a layer.

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.1)

Using fastai.collab

The structured above can be created using fastai.collab

learn = collab_learner(dls,n_factors=50,y_range=(0,5.5))

learn.fit_one_cycle(5,5e-3,wd=0.1)

Then, we can show the names of layers

learn.model

EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

Now, we have succesfully trained a model.

Deep Learning for Collaborative Filtering

To turn our architecture into a deep learning model, the first step is to take the results of the embedding lookup and concatenate those activations together. This gives us a matrix which we can then pass through linear layers and nonlinearities in the usual way.

Since we'll be concatenating the embeddings, rather than taking their dot product, the two embedding matrices can have different sizes (i.e., different numbers of latent factors). fastai has a function get_emb_sz that returns recommended sizes for embedding matrices for your data, based on a heuristic that fast.ai has found tends to work well in practice:

embs = get_emb_sz(dls)
class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range
        
    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x = self.layers(torch.cat(embs, dim=1))
        return sigmoid_range(x, *self.y_range)

model = CollabNN(*embs)

learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd=0.01)

learn = collab_learner(dls, use_nn=True, y_range=(0, 5.5), layers=[100,50])
learn.fit_one_cycle(5, 5e-3, wd=0.1)

	user	movie	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

	movie	title
0	1	Toy Story (1995)
1	2	GoldenEye (1995)
2	3	Four Rooms (1995)
3	4	Get Shorty (1995)
4	5	Copycat (1995)

	user	movie	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

	user	title	rating
0	679	Santa Clause, The (1994)	3
1	1	Exotica (1994)	4
2	259	Apocalypse Now (1979)	5
3	450	Courage Under Fire (1996)	4
4	774	True Lies (1994)	1
5	533	Leaving Las Vegas (1995)	1
6	561	Star Trek: The Wrath of Khan (1982)	3
7	683	Father of the Bride (1950)	3
8	417	So I Married an Axe Murderer (1993)	3
9	424	English Patient, The (1996)	4

epoch	train_loss	valid_loss	time
0	1.339669	1.277186	00:09
1	1.108406	1.090037	00:09
2	0.967013	0.980453	00:08
3	0.858148	0.892111	00:09
4	0.790688	0.873726	00:09

epoch	train_loss	valid_loss	time
0	1.018326	0.995652	00:09
1	0.892207	0.899135	00:08
2	0.662484	0.868134	00:08
3	0.471682	0.875005	00:08
4	0.354474	0.880524	00:08

epoch	train_loss	valid_loss	time
0	0.967305	0.991018	00:09
1	0.853391	0.898885	00:09
2	0.670772	0.872404	00:09
3	0.468066	0.882217	00:08
4	0.354991	0.886515	00:08

epoch	train_loss	valid_loss	time
0	1.036622	1.010733	00:09
1	0.927190	0.930999	00:08
2	0.801266	0.869415	00:08
3	0.650525	0.836668	00:08
4	0.576153	0.834967	00:09

epoch	train_loss	valid_loss	time
0	0.950444	0.946037	00:12
1	0.863254	0.873889	00:11
2	0.721644	0.831558	00:11
3	0.584056	0.819517	00:11
4	0.491901	0.821385	00:11

epoch	train_loss	valid_loss	time
0	0.956470	0.959027	00:10
1	0.871686	0.870266	00:10
2	0.729089	0.828011	00:08
3	0.596814	0.816535	00:10
4	0.490293	0.816821	00:10

epoch	train_loss	valid_loss	time
0	0.945726	0.957030	00:16
1	0.873854	0.894093	00:10
2	0.881610	0.876000	00:10
3	0.812847	0.863269	00:10
4	0.778358	0.865262	00:10

epoch	train_loss	valid_loss	time
0	0.963573	0.969183	00:12
1	0.907858	0.930464	00:12
2	0.892568	0.885867	00:11
3	0.818842	0.858451	00:11
4	0.735407	0.864158	00:11