Subreddit User Recommendation System based on Netflix's Algorithm

17 Dec 2019

Reading time ~13 minutes

Screen Shot 2019-12-17 at 10 50 28 PM

you can find my wordcloud tutorial here: Link

User recommendation system is widely used in a lot of e-commerce or entertainment businesses to encourage more customer satisfaction and generate revenue for the company. The most commonly known method for this is called the collaborative filtering. Companies such as Netflix, for example, uses this method to recommend movies that matches their users’ taste. Inspired by Nexflix’s world class user recommendation algorithm, I am going to generate a recommendation system to help Reddit users find subreddit that they might enjoy in this project. Find my full python code here

Introduction

Before we began, let me first introduce you to reddit. What is Reddit? Reddit is a website which provide users with news aggregation, web content rating, and discussion. In addition, it allows its members to submit content to the site such as text, images, or links, which other members can interact with by leaving comments or voting up or down. One main component of Reddit is the subreddit. Subreddits are specific online communities within the Reddit website that serve as a topic or category that organize all the posts associated with them.

Despite being a very popular website with over 300 million users based on the data collected in 2018, there is still no good way to discover subreddits. Currently, users have to manually search for popular subreddits or upvote a subreddit so that it will be archived in their subreddit history.

Getting The Data

To get the dataset, I utilized the Reddit API for python called PRAW. The main requirements are to have your client_id, client_secret, user_agent. Using the API, I collected a dataset consisting of ~15K unique redditors (reddit users) and ~29K unique subreddits in the below format:

the redditors’ username
their subreddit submissions (or comments)
the timestamp (utc) when they create the submission

Because it took a long time to scrape the data, I exported the final output to a csv. My PRAW Code is also available on my GitHub.

reddit = praw.Reddit(client_id='your client id',
                    client_secret='your client secret',
                    user_agent='project name')

r = reddit.subreddit('all').top(limit=15000)
redditor = []
subreddit_name = []
utc = []
i=0

for submission in r:
    redditor.append(submission.author)
    subreddit_name.append(submission.subreddit)
    utc.append(submission.created_utc)
    i+=1
    

	username	subreddit	utc
0	kabanossi	photoshopbattles	1.482748e+09
1	kabanossi	GetMotivated	1.482748e+09
2	kabanossi	vmware	1.482748e+09
3	kabanossi	carporn	1.482748e+09
4	kabanossi	DIY	1.482747e+09

Checking the Data The Data contains 15K redditors and ~29K subreddits.

unique reddittor: 15000
unique subreddit: 29281
total data entry: (9391244, 3)

Are there null values from our API dataset?  
username     False
subreddit    False
utc          False
dtype: bool

Collaborative Filtering and Singular Value Decomposition

Now that we have the dataset, let’s start building the algorithm. For this project, I used the Singular Value Decomposition (SVD) technique to create the subreddit recommendation.

The SVD is a well-known matrix factorization method and is very well-studied. The winning team at the Netflix Prize competition in 2009 used some combinations of advanced SVD matrix factorization models to produce movie recommendations with an improved RMSE of ~8% from Netflix’s Recommendation System at the time.

Before we begin, let’s look into the concept behind it first. The Collaborative filtering is a method to predict a rating for a user item pair based on the history of ratings given by the user and given to the item.

“Collaborative filtering captures the underlying pattern of interests of like-minded users and uses the choices and preferences of similar users to suggest new items.”

SVD is a matrix factorization technique that is used to reduce the number of features of a data set by reducing space dimensions. The matrix factorization is done on the user-item ratings matrix. From a high level, matrix factorization can be thought of as (using product and factorization) to find 2 matrices whose product is the original matrix. In other words, it approximates a single matrix A by the product of three matrices.

Screen Shot 2019-12-17 at 11 49 17 PM

In our context, U contains data for each username and their subreddit submissions, V is the subreddit name and its number of submissions by the user, while Σ are called the singular values and it indicates the strength of how U and V are related. We can approximate the full matrix by observing only the most important features using the singular values (Σ). To read more about SVD please refer to these research papers: [1] Netflix SVD [2] SVD for Recommendation Engine

Now that you have enough theoritical information about SVD, let’s look into the application. In this project, I will be using python’s scipy package of sparsesvd. With this package, the requirement is to have our data in a matrix format and state the latent factor or the number of important features we would like to consider. I will explain more about this in the later section.

Libraries

Below are the libraries I used:

import numpy as np
import pandas as pd
import math as mt
import csv
from pandas import DataFrame,Series,read_csv
import scipy
import scipy.sparse as sp
from sparsesvd import sparsesvd        #used for matrix factorization
from scipy.sparse import csc_matrix    #used for sparse matrix
from scipy.sparse.linalg import *      #used for matrix multiplication
from scipy.linalg import sqrtm
from nltk.tokenize import TreebankWordTokenizer #used this to create the document on my matrix

Evaluating our SVD Model with Train and Test Data by Sampling 500 users

Using a sample of 500 users from the entire dataset, I created both train and test data (20% the data in the test data) to evaluate the prediction of my SVD model.

sample_username = list(reddit_df.username.unique())[300:800]
sample_df = reddit_df[reddit_df.username.isin(sample_username)]

users = list(sample_df.username.unique())
subreddits = list(sample_df.subreddit.unique())

Let’s look at the top 65% subreddits within the sample dataset. There are ~134 subreddits that contribute to the entire sample dataset. We will use this for our latent factor in the SVD model later.

Top 134 subreddits contribute a total of 64.9 % to the total subreddits in the dataset

Splitting Train and Test Dataset based on Timestamp (utc)

I wanted to ensure that my test and train data are as realistic as possible. Therefore, I used the timestamp (utc) information to split the data instead of doing random 20% split. In this case, I sorted the data from the earliest submission to the most recent by each user, and using those, pick the bottom or most recent 20% data of each user as the test data.

data =sample_df.groupby(['username','subreddit']).agg({'subreddit':'count',
                                                                 'utc':'max'}).\
              rename(columns={'subreddit':'submission_freq','utc':'most_recent_timestamp'}).reset_index()
data.head(10)

	username	subreddit	submission_freq	most_recent_timestamp
0	-SA-HatfulOfHollow	news	1	1.482761e+09
1	-SA-HatfulOfHollow	reddevils	1	1.482742e+09
2	-SA-HatfulOfHollow	soccer	1	1.482771e+09
3	-SA-HatfulOfHollow	worldnews	11	1.476293e+09
4	-_-_-_-otalp-_-_-_-	Android	3	1.475605e+09
5	-_-_-_-otalp-_-_-_-	AskAnthropology	2	1.480134e+09
6	-_-_-_-otalp-_-_-_-	AskReddit	2	1.482744e+09
7	-_-_-_-otalp-_-_-_-	BlackPeopleTwitter	6	1.482560e+09
8	-_-_-_-otalp-_-_-_-	CrazyIdeas	1	1.480079e+09
9	-_-_-_-otalp-_-_-_-	DC_Cinematic	2	1.476638e+09

Defining The Rating in The Train and Test Data

Evaluating our SVD model means that we are comparing the predicted vs actual rating that a user give to a subreddit. Since there is no obvious way to capture rating with Reddit data, we will define the rating as the number of times user i has submmited a post in subreddit j, vi,j, divided by his total number of submission.

Screen Shot 2019-12-18 at 1 00 41 PM

In a different studies by Jay Baxter, a user upvote can also be used to capture the rating. However, the user upvote data are proctected and can not be retrieved using PRAW (Reddit API for Python). Therefore, I used user submission instead.

user_sum = data.groupby(['username'], as_index=False).agg({'submission_freq':'sum'})
temp = pd.merge(left = data, right = user_sum, how='left', left_on='username',right_on='username').\
                rename(columns={'submission_freq_y':'user_sum',
                               'submission_freq_x':'submission_freq'})
data['user_implicit_rating'] = temp.submission_freq/temp.user_sum
data.drop(columns=['submission_freq'], inplace=True)
data = pd.concat([data.iloc[:,:2],data.iloc[:,-1:],data['most_recent_timestamp']], axis=1)
data.dropna(inplace = True)

Let’s look at an example train and test data for a user.

Train Data for User "-_-_-_-otalp-_-_-_-"        :
                  username             subreddit  user_implicit_rating
-_-_-_-otalp-_-_-_-             worldnews              0.001017
-_-_-_-otalp-_-_-_-            OCOCTATIAT              0.001017
-_-_-_-otalp-_-_-_-           nottheonion              0.001017
-_-_-_-otalp-_-_-_-     millionairemakers              0.001017
-_-_-_-otalp-_-_-_-          changemyview              0.001017
-_-_-_-otalp-_-_-_-                  news              0.001017
-_-_-_-otalp-_-_-_-               Android              0.003052
-_-_-_-otalp-_-_-_-               chomsky              0.002035
-_-_-_-otalp-_-_-_-          DC_Cinematic              0.002035
-_-_-_-otalp-_-_-_-        TheoryOfReddit              0.002035
-_-_-_-otalp-_-_-_-         UpliftingNews              0.001017
-_-_-_-otalp-_-_-_-          the_meltdown              0.001017
-_-_-_-otalp-_-_-_-               pokemon              0.001017
-_-_-_-otalp-_-_-_-             australia              0.009156
-_-_-_-otalp-_-_-_-  Political_Revolution              0.005086
-_-_-_-otalp-_-_-_-                 funny              0.004069
-_-_-_-otalp-_-_-_-                 Jokes              0.001017
-_-_-_-otalp-_-_-_-                  gifs              0.001017
-_-_-_-otalp-_-_-_-              counting              0.001017
-_-_-_-otalp-_-_-_-                iphone              0.004069
-_-_-_-otalp-_-_-_-            CrazyIdeas              0.001017
-_-_-_-otalp-_-_-_-            badhistory              0.001017
-_-_-_-otalp-_-_-_-       AskAnthropology              0.002035
-_-_-_-otalp-_-_-_-                 otalp              0.001017
-_-_-_-otalp-_-_-_-       depressedrobots              0.001017
-_-_-_-otalp-_-_-_-       EnoughTrumpSpam              0.053917
 
Test Data for User "-_-_-_-otalp-_-_-_-"        :
                  username           subreddit  user_implicit_rating
-_-_-_-otalp-_-_-_-              me_irl              0.001017
-_-_-_-otalp-_-_-_-           socialism              0.002035
-_-_-_-otalp-_-_-_-           spacemacs              0.001017
-_-_-_-otalp-_-_-_-           radiohead              0.008138
-_-_-_-otalp-_-_-_-               meirl              0.001017
-_-_-_-otalp-_-_-_-               Music              0.005086
-_-_-_-otalp-_-_-_-          indieheads              0.001017
-_-_-_-otalp-_-_-_-      Showerthoughts              0.003052
-_-_-_-otalp-_-_-_-              movies              0.014242
-_-_-_-otalp-_-_-_-            politics              0.060020
-_-_-_-otalp-_-_-_-  BlackPeopleTwitter              0.006104
-_-_-_-otalp-_-_-_-           dagandred              0.010173
-_-_-_-otalp-_-_-_-         LiverpoolFC              0.379451
-_-_-_-otalp-_-_-_-            Fuck2016              0.018311
-_-_-_-otalp-_-_-_-            bidenbro              0.001017
-_-_-_-otalp-_-_-_-               apple              0.005086
-_-_-_-otalp-_-_-_-             atheism              0.001017
-_-_-_-otalp-_-_-_-           AskReddit              0.002035
-_-_-_-otalp-_-_-_-          botsrights              0.027467
-_-_-_-otalp-_-_-_-              soccer              0.348932

Transforming the Dataframe into Utility Matrix and SVD Computation

As mentioned earlier, one of the requirement to use SVD model is to transfor our data into a matrix. In this case, I will fill all the subreddits that a user has not had submission on with zero’s. Once we have our matrix ready, let’s create the SVD function.

def svd(train, k):
    utilMat = np.array(train)
    # the nan or unavailable entries are masked
    mask = np.isnan(utilMat)
    masked_arr = np.ma.masked_array(utilMat, mask)
    item_means = np.mean(masked_arr, axis=0)
    # nan entries will replaced by the average rating for each item
    utilMat = masked_arr.filled(item_means)
    x = np.tile(item_means, (utilMat.shape[0],1))
    # we remove the per item average from all entries.
    # the above mentioned nan entries will be essentially zero now
    utilMat = utilMat - x
    # The magic happens here. U and V are user and item features
    U, s, V=np.linalg.svd(utilMat, full_matrices=False)
    s=np.diag(s)
    # we take only the k most significant features
    s=s[0:k,0:k]
    U=U[:,0:k]
    V=V[0:k,:]
    s_root=sqrtm(s)
    Usk=np.dot(U,s_root)
    skV=np.dot(s_root,V)
    UsV = np.dot(Usk, skV)
    UsV = UsV + x
    print("svd done")
    return UsV

Now let’s run our prediction on test data and compare it againts the actual. In this case, I will use MSE and MAE.

def mse(true, pred):
    # this will be used towards the end
    x = true - pred
    return sum([xi*xi for xi in x])/len(x)

def mae(true, pred):
    # this will be used towards the end
    x = abs(true - pred)
    return sum([xi for xi in x])/len(x)


# to test the performance over a different number of features
no_of_features = [134]
svdout = svd(X, k=no_of_features)

print(mse(test['user_implicit_rating'], pred))
print(mae(test['user_implicit_rating'], pred))

Based on the output, we got a pretty good result with a MAE is around 0.5%.

svd done
mse: 0.0001907806366327251
mae: 0.005219784277697061

Creating The Recommendation System Using The Complete Dataset (15K Users)

Now that we have evaluated the model, let’s use the entire dataset and make a demo recommendation. To do this, we will do the same steps of converting the data into a matrix as well as finding the latent factor.

In the full dataset, there are 293 subreddits contributing to the top 65% subreddit topics. Therefore, I used this number to define the latent factor in the SVD model.

Top 293 subreddits contribute a total of 65.0 % to the total subreddits in the dataset

Given that this dataset is much larger than the sample, we will use another method to organize the data and alter it into a matrix form for the SVD.

In this case, I changed the row in the dataset to be 1 row per each user and tokenize all the subreddits where each users had posted a submission on using the TreebankWordTokenizer from the nltk library.

tokenizer = TreebankWordTokenizer()
document = doc_df.iloc[:, 1]
document = document.apply(lambda row: tokenizer.tokenize(row))
document.head()

  [Testosterone, Testosterone, Testosterone, Tes...
  [DestinyTheGame, DestinyTheGame, DestinyTheGam...
  [AceAttorney, AceAttorney, AceAttorney, AceAtt...
  [LGBTeens, Patriots, asktransgender, Patriots,...
  [tdi, tdi, tdi, AskReddit, tdi, tdi, tdi, tdi,...
Name: subreddit, dtype: object

Creating User-Subreddit Matrix

Once we tokenized the subreddit, time to create our matrix for the SVD. Our matrix is very sparse in that it has a lot of zeros because there are a lot of user-subreddit combinations where the users have not posted a submission on. This large and sparse matrix can take a long time to run. Therefore, I will use the sparse matrix from the scipy library. This helps storing the matrices in less memory.

corpus_of_subs = []
for subreddits in subreddit:
    corpus_of_subs.append(subreddits)


voc2id = dict(zip(corpus_of_subs, range(len(corpus_of_subs))))
rows, cols, vals = [], [], []
for r, d in enumerate(document):
    for e in d:
        if voc2id.get(e) is not None:
            rows.append(r)
            cols.append(voc2id[e])
            vals.append(1)
user_subreddit_matrix = csc_matrix((vals, (rows, cols)), dtype=np.float32)
print((user_subreddit_matrix.shape))

(14999, 29280)

Now, our matrix is pretty much ready and we run the SVD and the recommendation with the full matrix.

SVD Function

def computeSVD(user_subreddit_matrix, no_of_latent_factors):
    
    """Compute the SVD of the given matrix.
    :user_subreddit_matrix: a numeric matrix
    :no_of_latent_factors : numeric scalar value
    
    :U  : User to concept matrix 
    :S  : Strength of the concepts matrix
    :Vt : Subreddit to concept matrix
    """
    U, s, Vt = sparsesvd(user_subreddit_matrix, no_of_latent_factors)
    
    dim = (len(s), len(s))
    S = np.zeros(dim, dtype=np.float32)
    for i in range(0, len(s)):
        S[i,i] = mt.sqrt(s[i])

    U = csc_matrix(np.transpose(U), dtype=np.float32)
    S = csc_matrix(S, dtype=np.float32)
    Vt = csc_matrix(Vt, dtype=np.float32)

    return U, S, Vt

Recommendation Output Function

#Compute estimated recommendations for the given user
def computeEstimatedRecommendation(U, S, Vt, uTest):
    """Compute the recommendation for the given user.
    
    :U     : User to concept matrix 
    :S     : Strength of the concepts matrix
    :Vt    : Subreddit to concept matrix
    :uTest : Index of the user for which the recommendation has to be made
    
    :recom : List of recommendations made to the user
    """
 
    #constants defining the dimensions of the estimated rating matrix
    MAX_PID = len(subreddit)
    MAX_UID = len(user)
    
    rightTerm = S*Vt 

    EstimatedRecommendation = np.zeros(shape=(MAX_UID, MAX_PID), dtype=np.float16)
    for userTest in uTest:
        prod = U[userTest, :]*rightTerm
        # Converting the vector to dense format in order to get the indices 
        # of the movies with the best estimated ratings 
        
        EstimatedRecommendation[userTest, :] = prod.todense()
        recom = (-EstimatedRecommendation[userTest, :]).argsort()[:293]
    return recom

Below are some recommendation demo based on the output of our SVD model using the entire dataset of 15K users and ~29 subreddits.

Recommendation Demo 1

uTest = [np.where(user == 'CarnationsPls')[0][0]]
U, S, Vt = computeSVD(user_subreddit_matrix, no_of_latent_factors)

------------------------------------------------------------------------------------

Redditor: CarnationsPls

------------------------------------------------------------------------------------

User Subreddit History :

sports
gaming
gifs
AskReddit
fo4
todayilearned
TheLastAirbender
realrule34
------------------------------------------------------------------------------------

Recommendation for CarnationsPls : 

Calgary
Amd
pcgaming
techsupport
NoMansSkyTheGame
------------------------------------------------------------------------------------

Recommendation Demo 2

uTest = [np.where(user == 'comicfan815')[0][0]]
U, S, Vt = computeSVD(user_subreddit_matrix, no_of_latent_factors)

------------------------------------------------------------------------------------

Redditor: comicfan815

------------------------------------------------------------------------------------

User Subreddit History :

reactiongifs
cringepics
nba
lakers
NBA2k
------------------------------------------------------------------------------------

Recommendation for comicfan815 : 

rockets
warriors
bostonceltics
sixers
torontoraptors
------------------------------------------------------------------------------------

Conclusion

SVD is definitely a great technique for collaborative filtering and it handles large dataset really well in that it is able to sort through the most important features to use for the prediction. There are so much more things that we can do with SVD to create better prediction for collaborative filtering, including utilizing gradient descent to minimize the error produced by the SVD model. While this project is only a POC (Prove of Concept) and I am only using the simple built-in SVD function by the scipy library, I hope to utilize this method with the conjuction of other technique such as the Restricted Boltzman Machine (RBM) in my future work.

References

Hutchinson, A. (2018, April 20). Reddit Now Has as Many Users as Twitter, and Far Higher Engagement Rates. Retrieved from https://www.socialmediatoday.com/news/reddit-now-has-as-many-users-as-twitter-and-far-higher-engagement-rates/521789/.

Baxter, J. (2016). A comparative analysis of subreddit recommenders for Reddit. Retrieved from http://jaybaxter.net/redditrecommender.pdf

Salakhutdinov, R. & Mnih, A. & Hinton G. (2007). Restricted Boltzmann Machines for Collaborative Filtering. Retrieved from https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

Thank you for reading! Please feel free to contact me directly for any comments, feedbacks, or suggestions. You can leave a comment below as well!

Analysis and Report Written by : Jennifer Siwu