How to build basic recommender system for movies

September 14, 2022

Introduction

The goal of a recommender system is to generate meaningful recommendations to a collection of users for items or products that might interest them. Suggestions for books on Amazon, or movies on Netflix, are real-world examples of the operation of industry-strength recommender systems. The design of such recommendation engines depends on the domain and the particular characteristics of the data available. Recommender systems differ in the way they analyze these data sources to develop notions of affinity between users and items, which can be used to identify well-matched pairs. Collaborative Filtering systems analyze historical interactions alone, while Content-based Filtering systems are based on profile attributes; and hybrid techniques attempt to combine both of these designs. In this example both algorithms will be showcased.

Data

Movielens dataset for movies was used.

Libraries

Libraries used: Pandas Math Scipy Os Numpy Csv Itertools Sklearn

Collaborative RS

Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something, such as a video, a book or a product. Collaborative filtering is also known as social filtering.

Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.

One of the collaborative algorithms is the SlopeOne algorithm.

class SlopeOnePredictor():
    def __init__(self,user):
        self.user=user
    
    def fit(self):
        ratings = pd.read_csv('PROCESSEDRATINGDATA1.csv')  
        baba = ratings.pivot_table(index=['userID'],columns=['movieID'],values='rating')
        baba.fillna(0, inplace=True)
        return(self.sloppy_one(baba))
    
    def obradapodataka(self,ff,one,two):
        p1=ff[ff[one]!=0]
        l1=p1[p1[two]!=0]
        k1=len(l1)
        qq=((l1[one].sum()-l1[two].sum())/k1)
        return(qq,k1)
    
    def sloppy_one(self,scp):

        dev = pd.DataFrame(index=scp.columns,columns=scp.columns)
        cardinality= pd.DataFrame(index=scp.columns,columns=scp.columns)
        for (columnName, columnData) in scp.iteritems(): 
            for (columnName1, columnData1) in scp.iteritems(): 
                if(columnName1!=columnName):
                    ff=scp[columnName1]
                    fq=scp[columnName]
                    df = pd.concat([ff, fq], axis=1)
                    t1,t2=self.obradapodataka(df,columnName,columnName1)
                    dev[columnName1][columnName]=t1
                    cardinality[columnName1][columnName]=t2
        
        qq=self.nadjusera(scp,dev,cardinality)
        return(qq)

                              
    def nadjusera(self,scp,dev,car):
        userid=self.user
        dd=scp.transpose()
        dd2=scp.transpose()
        nein= pd.DataFrame()
        nein2 = scp.transpose()
        dd[userid] = dd[userid].replace([0],np.nan)
        for (columnName, columnData) in scp.iteritems(): 
            ff=(dd[userid]+dev[columnName])*car[columnName]
            nein['rat']=ff
            nein['card']=car[columnName]
            nein.fillna(0,inplace=True)
            nein=nein[nein['rat']!= 0]
            ses=nein['card'].sum()
            ses2=nein['rat'].sum()
            nein2[userid][columnName]=(ses2/ses)
        kk=nein2[userid].sort_values(ascending=False)
        k1=pd.DataFrame(columns=[userid])
        k1[userid]=kk
        k1[k1>5]=5
        k1.columns = ['Score']
        return(k1)

There is an article which explains slope one nicely: https://medium.com/@andresespinosapc/today-im-writing-about-slope-one-predictors-for-online-rating-based-collaborative-filtering-by-3c06677f75c6

Content based predictor

A Content-Based Recommender works by the data that we take from the user, either explicitly (rating) or implicitly (clicking on a link). By the data we create a user profile, which is then used to suggest to the user, as the user provides more input or take more actions on the recommendation, the engine becomes more accurate.

class ItemBasedPredictor():
    def __init__(self, usr,threshold=0 , min_values=0):
        self.threshold = threshold
        self.min_values = min_values
        self.corrMatrix = pd.DataFrame()
        self.userID1 = usr
        #solved problem of min values by taking only movies with 1000+ reviews so it doesn't take forever to run

    def fit(self):
        ratings = pd.read_csv('processed3_data.csv')
        moviedata= pd.read_csv('movies.csv')
        ratings = pd.merge(moviedata,ratings,left_on='id', right_on= 'movieID')
        baba = ratings.pivot_table(index=['userID'],columns=['movieID'],values='rating')
        biba=baba.sub(baba.mean(axis=1), axis=0)
        biba.fillna(0, inplace=True)
        sparse_df = sparse.csr_matrix(biba.values)
        scp=sparse_df.transpose()
        self.corrMatrix = pd.DataFrame(cosine_similarity(scp),index=biba.columns,columns=biba.columns)
        self.corrMatrix[self.corrMatrix < 0]=0
        self.corrMatrix[self.corrMatrix < self.threshold]=0
        ratings = pd.read_csv('processed3_data.csv')
        rd=ratings[ratings.userID == self.userID1]
        sum=0
        sum1=0
        for index, row in rd.iterrows():
            sum=sum+self.corrMatrix[int(row[1])]*row[2]
            sum1=sum1+self.corrMatrix[int(row[1])]
        k1=pd.DataFrame(columns=['Score'])
        k1['Score']=sum/sum1
        return (k1)
        
    def predict(self,userID1,movieID1):
        moviedata= pd.read_csv('movies.csv')
        md=moviedata[moviedata.id == movieID1]
        md1=str(*md['title'])
        ratings = pd.read_csv('processed3_data.csv')
        rd=ratings[ratings.userID == self.userID1]
        sum=0
        sum1=0
        for index, row in rd.iterrows():
            sum=sum+self.corrMatrix[int(row[1])]*row[2]
            sum1=sum1+self.corrMatrix[int(row[1])]
        return (sum/sum1)[movieID1]
        
    def similarity(self, p1, p2):
        print(self.CM[p1][p2])
    
    def similarItems(self, item, n):
        df=self.corrMatrix[item]
        df[df>0.999]= 0
        dd=df.sort_values(ascending=False)
        print(dd.head(n))
    
    def mostSimilar(self):
        df=self.corrMatrix
        df[df>0.999]= 0
        dd=df.unstack()
        dd1=dd.sort_values(kind="quicksort",ascending=False)
        print(dd1.head(20))

There is article explaining CBRS: https://www.analyticssteps.com/blogs/what-content-based-recommendation-system-machine-learning.

Conclusion

After bulding these two algorithms for RS, we can then either create hybrid approach of using both algorithms at same time or we can use them individually. The reason RSs are important is because they reduce transaction costs of finding and selecting items in an online shopping environment and they have also proved to improve decision making process and quality. Every bigger online streaming service, retailer, social media, etc is enhanced by use of RSs. They are key element for keeping the customers and drawing them back in, so without RS, we wouldn’t have big platforms like Youtube, IG, etc.