Introduction
The goal of a recommender system is to generate meaningful recommendations to a collection of users for items or products that might interest them. Suggestions for books on Amazon, or movies on Netflix, are real-world examples of the operation of industry-strength recommender systems. The design of such recommendation engines depends on the domain and the particular characteristics of the data available.
Recommender systems differ in the way they analyze these data sources to develop notions of affinity between users and items, which can be used to identify well-matched pairs. Collaborative Filtering systems analyze historical interactions alone, while Content-based Filtering systems are based on profile attributes; and hybrid techniques attempt to combine both of these designs. In this example both algorithms will be showcased.
Data
Movielens dataset for movies was used.
Libraries
Libraries used:
Pandas
Math
Scipy
Os
Numpy
Csv
Itertools
Sklearn
Collaborative RS
Collaborative filtering is the predictive process behind recommendation engines. Recommendation engines analyze information about users with similar tastes to assess the probability that a target individual will enjoy something, such as a video, a book or a product. Collaborative filtering is also known as social filtering.
Collaborative filtering uses algorithms to filter data from user reviews to make personalized recommendations for users with similar preferences. Collaborative filtering is also used to select content and advertising for individuals on social media.
One of the collaborative algorithms is the SlopeOne algorithm.
class SlopeOnePredictor():
def __init__(self,user):
self.user=user
def fit(self):
ratings = pd.read_csv('PROCESSEDRATINGDATA1.csv')
baba = ratings.pivot_table(index=['userID'],columns=['movieID'],values='rating')
baba.fillna(0, inplace=True)
return(self.sloppy_one(baba))
def obradapodataka(self,ff,one,two):
p1=ff[ff[one]!=0]
l1=p1[p1[two]!=0]
k1=len(l1)
qq=((l1[one].sum()-l1[two].sum())/k1)
return(qq,k1)
def sloppy_one(self,scp):
dev = pd.DataFrame(index=scp.columns,columns=scp.columns)
cardinality= pd.DataFrame(index=scp.columns,columns=scp.columns)
for (columnName, columnData) in scp.iteritems():
for (columnName1, columnData1) in scp.iteritems():
if(columnName1!=columnName):
ff=scp[columnName1]
fq=scp[columnName]
df = pd.concat([ff, fq], axis=1)
t1,t2=self.obradapodataka(df,columnName,columnName1)
dev[columnName1][columnName]=t1
cardinality[columnName1][columnName]=t2
qq=self.nadjusera(scp,dev,cardinality)
return(qq)
def nadjusera(self,scp,dev,car):
userid=self.user
dd=scp.transpose()
dd2=scp.transpose()
nein= pd.DataFrame()
nein2 = scp.transpose()
dd[userid] = dd[userid].replace([0],np.nan)
for (columnName, columnData) in scp.iteritems():
ff=(dd[userid]+dev[columnName])*car[columnName]
nein['rat']=ff
nein['card']=car[columnName]
nein.fillna(0,inplace=True)
nein=nein[nein['rat']!= 0]
ses=nein['card'].sum()
ses2=nein['rat'].sum()
nein2[userid][columnName]=(ses2/ses)
kk=nein2[userid].sort_values(ascending=False)
k1=pd.DataFrame(columns=[userid])
k1[userid]=kk
k1[k1>5]=5
k1.columns = ['Score']
return(k1)
There is an article which explains slope one nicely: https://medium.com/@andresespinosapc/today-im-writing-about-slope-one-predictors-for-online-rating-based-collaborative-filtering-by-3c06677f75c6
Content based predictor
A Content-Based Recommender works by the data that we take from the user, either explicitly (rating) or implicitly (clicking on a link). By the data we create a user profile, which is then used to suggest to the user, as the user provides more input or take more actions on the recommendation, the engine becomes more accurate.
class ItemBasedPredictor():
def __init__(self, usr,threshold=0 , min_values=0):
self.threshold = threshold
self.min_values = min_values
self.corrMatrix = pd.DataFrame()
self.userID1 = usr
#solved problem of min values by taking only movies with 1000+ reviews so it doesn't take forever to run
def fit(self):
ratings = pd.read_csv('processed3_data.csv')
moviedata= pd.read_csv('movies.csv')
ratings = pd.merge(moviedata,ratings,left_on='id', right_on= 'movieID')
baba = ratings.pivot_table(index=['userID'],columns=['movieID'],values='rating')
biba=baba.sub(baba.mean(axis=1), axis=0)
biba.fillna(0, inplace=True)
sparse_df = sparse.csr_matrix(biba.values)
scp=sparse_df.transpose()
self.corrMatrix = pd.DataFrame(cosine_similarity(scp),index=biba.columns,columns=biba.columns)
self.corrMatrix[self.corrMatrix < 0]=0
self.corrMatrix[self.corrMatrix < self.threshold]=0
ratings = pd.read_csv('processed3_data.csv')
rd=ratings[ratings.userID == self.userID1]
sum=0
sum1=0
for index, row in rd.iterrows():
sum=sum+self.corrMatrix[int(row[1])]*row[2]
sum1=sum1+self.corrMatrix[int(row[1])]
k1=pd.DataFrame(columns=['Score'])
k1['Score']=sum/sum1
return (k1)
def predict(self,userID1,movieID1):
moviedata= pd.read_csv('movies.csv')
md=moviedata[moviedata.id == movieID1]
md1=str(*md['title'])
ratings = pd.read_csv('processed3_data.csv')
rd=ratings[ratings.userID == self.userID1]
sum=0
sum1=0
for index, row in rd.iterrows():
sum=sum+self.corrMatrix[int(row[1])]*row[2]
sum1=sum1+self.corrMatrix[int(row[1])]
return (sum/sum1)[movieID1]
def similarity(self, p1, p2):
print(self.CM[p1][p2])
def similarItems(self, item, n):
df=self.corrMatrix[item]
df[df>0.999]= 0
dd=df.sort_values(ascending=False)
print(dd.head(n))
def mostSimilar(self):
df=self.corrMatrix
df[df>0.999]= 0
dd=df.unstack()
dd1=dd.sort_values(kind="quicksort",ascending=False)
print(dd1.head(20))
There is article explaining CBRS: https://www.analyticssteps.com/blogs/what-content-based-recommendation-system-machine-learning.
Conclusion
After bulding these two algorithms for RS, we can then either create hybrid approach of using both algorithms at same time or we can use them individually. The reason RSs are important is because they reduce transaction costs of finding and selecting items in an online shopping environment and they have also proved to improve decision making process and quality. Every bigger online streaming service, retailer, social media, etc is enhanced by use of RSs. They are key element for keeping the customers and drawing them back in, so without RS, we wouldn’t have big platforms like Youtube, IG, etc.