hello@vshsolutions.com

Developing a Recommender System for Movie Recommendation in Python

“People who bought Item X also bought item Y.” Whenever you buy something from Amazon or eBay, you would have seen suggestions like these. Ever wondered how Amazon and eBay finds the related products for users? The answer to this question lies in a complex statistical algorithm known as a recommender system or recommendation engine.

A recommender system is a system that intends to find the similarities between the products, or the users that purchased these products on the base of certain characteristics. There are two intuitions behind recommender systems:

If a user buys a certain product, he is likely to buy another product with similar characteristics.
If multiple users buy a set of products together, then a new user may also buy those set of products.

Owing to their huge success, more and more companies are now employing recommender systems for making different types of recommendations. For instance, Youtube uses recommendation systems for video recommendation, Facebook uses it for friend recommendation, and Netflix uses it for movie recommendation.

Recommender System Types

Based on the products and users, the recommender system can be broadly divided into two categories:

Content-Based Filtering

Recommender systems based on content filtering finds similarities between products based on the attributes. The intuition behind content-based filtering is that if a person buys a product A with characteristics X, Y, Z, he is likely to buy any product B with characteristics X, Y, Z.

Collaborative Filtering

Collaborative filtering depends on user choices. The intuition behind collaborative filtering is that if a user X buys products A, B, and C. Then another user Y who buys products A and B is also likely to buy product C.

In this article, we will see how we can develop a movie recommender system based on collaborative filtering. Our recommendation system will find similarities between movies based on user ratings.

Movie Recommendation based on User Ratings

In this section, we will see how we can develop a very simple movie recommendation system in Python. The intuition behind our recommender system is that if a user A gives similar ratings to movies X and Y, then if another B user who has watched the movie X and has given it the same rating to it as the user A, user B is more likely to watch the movie Y as well.

Importing Required Libraries

As always, the first step is to import the libraries required. Look at the following script:

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

sns.set_style('dark')

%matplotlib inline

In the script above we import the pandas, numpy, matplotlib, and the seaborn libraries.

Importing the Dataset

We will use the MovieLens dataset to develop our recommender system. The dataset can be freely downloaded from this link. The file that you will need to download is the “ml-latest-small.zip”. It contains 100,000 reviews by 600 users for over 9000 different movies. Download and extract the file. You will see the following files in the folder:

To develop our system, we only need movies.csv and ratings.csv files.

Let’s first import the ratings.csv file and see how it looks like:

movie_ratings = pd.read_csv("G:\ml-latest-small\\ratings.csv")

movie_ratings.head()

In the script above, we imported the ratings.csv file using the “read_csv” method of the pandas library and then called the “head” method to see its first five records. The output looks like this:

From the output, you can see that the dataset has four columns: userId which contains the id of the user, movieId which contains the id of the movie, the rating for the movie which can be between 1 and 5 with 1 being the least and 5 being the highest rating, and finally the timestamp column which contains the time when the rating was left by the user. In place of movieId we need the name of the movie.

The movie names are stored in the movies.csv file. Let’s now import that file.

movie_names_data = pd.read_csv("G:\ml-latest-small\\movies.csv")

movie_names_data.head()

In the script above, we import the movies.csv file and display its first 5 records. The output looks like this:

The movies.csv file contains three columns: movieId, title, and the genres. We need a dataset that contains movie title as well as the ratings. We can create such a dataset by merging the ratings.csv and movies.csv using movieId as the common column. Look at the following script:

complete_movie_dataset = pd.merge(movie_ratings , movie_names_data , on='movieId')

complete_movie_dataset.head()

In the script above, we create a new dataset complete_movie_dataset which contains the following columns:

You now see that our dataset has userId, the title of the movie as well as the rating for each movie.

Note: It is important to mention that the dataset is updated regularly by the owner of the dataset and you might see different movies in the list.

Exploratory Data Analysis

Let’s explore the dataset a bit and see if we can find any patterns. Let’s first see the number of records for each rating. Execute the following script:

sns.countplot(x='rating', data=complete_movie_dataset)

The output looks like this:

From the output, you can see that most of the reviews have 3 and 4 ratings.

Let’s now take a look at average ratings for all the movies. To do so, we can use the groupby method of the pandas dataframe, which can be used to perform aggregate operations on the dataset. We will group the records by the title of the movie and then use the mean function to find the average ratings for the movie. Look at the following script:

complete_movie_dataset.groupby('title')['rating'].mean().head()

The output looks like this:

Let’s sort the records in the descending order of the average ratings:

complete_movie_dataset.groupby('title')['rating'].mean().sort_values(ascending=False).head()

The output looks like this:

From the output, you can see that the movies with an average rating of five are not very well known which is kind of strange. One of the reasons for such a strange output is that movies that have only one five star review make it to the top of our list. Therefore, it can be concluded that mean alone is not a good indicator of whether a movie is good or not. We also need the count for the rating. The intuition behind finding the count for the ratings is that famous movies usually get higher number of ratings as compared to movies that are not very famous. Let’ find the count of rating for each movie and then display the results in descending order.

complete_movie_dataset.groupby('title')['rating'].count().sort_values(ascending=False).head()

The output looks like this:

The output contains the number of ratings. However, we need both average rating and the number of ratings per movie. Let’s create a new dataset which contains both of these attributes:

mean_count_rating = pd.DataFrame(complete_movie_dataset.groupby('title')['rating'].mean())

mean_count_rating['Rating_Counts'] = pd.DataFrame(complete_movie_dataset.groupby('title')['rating'].count())

Now if you print the mean _count_rating dataframe, you should see the following results:

You can see that the dataset contains three columns. The title of the movie, the ratings for movie and the count for the ratings.

Developing a Recommendation System

Enough of the exploratory data analysis, let’s now see how we can develop a very simple recommender system based on similarities between movies. To find the similarities, we will calculate correlations between the ratings of the movies. To do so we need to create a new dataset where movie titles are represented in the form of columns and users are represented as indexes or rows. We need the data in this form since pandas contains built-in methods to find correlation between values in multiple columns.

Let’s create such a dataset:

movie_rating_user = complete_movie_dataset.pivot_table(index='userId', columns='title', values='rating')

movie_rating_user.head()

The output looks like this:

From the output, you can see user ids in row headers while movie titles in column headers. The ratings are NA because none of the users rated these movies. If you look at the complete dataset, you will find ratings for some of the movies.

Now let’s suppose we want to recommend movie to the users who have watched the movie “The Shawshank Redemption, (1994)”. First, we will have to retrieve all the ratings for the movie. The following script does that:

shawshank_redumption_ratings = movie_rating_user['Shawshank Redemption, The (1994)']

Next, we can use the “corrwith” function to return the correlation of all the movies with the movie “The Shawshank Redemption, (1994)”, based on the user ratings. The following script does that:

movies_like_shawshank_redumption = movie_rating_user.corrwith(shawshank_redumption_ratings)

movies_like_shawshank_redumption_corr = pd.DataFrame(movies_like_shawshank_redumption, columns=['Correlation'])

movies_like_shawshank_redumption_corr.dropna(inplace=True)

movies_like_shawshank_redumption_corr.head()

In the output, you can see some random movies along with their correlation with the movie “ The Shawshank Redemption, (1994)”. Let’s see movies with the highest correlation with “The Shawshank Redemption, (1994)”.

movies_like_shawshank_redumption_corr.sort_values('Correlation', ascending=False).head(5)

From the output you can see that the movies are not very famous, on the other hand, “The Shawshank Redemption, (1994)” is a very famous movie. The movies that have a high correlation with From the output you can see that the movies are not very famous, on the other hand, “The Shawshank Redemption, (1994)” is very famous movies. The movies correlated with “The Shawshank Redemption, (1994)” should also be famous.

One of the reasons for the movies with high correlation is that these movies might only have single or limited user reviews and these reviews are exactly the same as the review for “The Shawshank Redemption, (1994)”. Let’s filter the movies with a minimum rating count of 100 having a high correlation with the movie “The Shawshank Redemption, (1994)”. For that, first, we will have to add the “Rating_Counts” column to the correlation dataset. The following script does that:

movies_like_shawshank_redumption_corr = movies_like_shawshank_redumption_corr.join(mean_count_rating['Rating_Counts'])

movies_like_shawshank_redumption_corr.head()

Now let’s filter the movies with a minimum rating count of 100 having a high correlation with the “The Shawshank Redemption, (1994)”.

movies_like_shawshank_redumption_corr[movies_like_shawshank_redumption_corr['Rating_Counts']>100].sort_values('Correlation', ascending=False).head()

In the output, the following movies will be displayed:

The movie “The Shawshank Redemption, (1994)” is displayed at the top because of course, it has the highest correlation with itself. The rest of the movies are also very famous. These are the movies that will be recommended to the users who have watched “The Shawshank Redemption, (1994)”.

Conclusion

A Recommender system is one of the most useful applications of statistical learning algorithms. Large businesses have successfully employed recommender systems to boost their sales and marketing. In this article, we saw a very simple example of a recommender system. We saw how a collaborative filtering recommender system can be developed to recommend movies based on correlation.

VSH is a leading python development and consulting company with expertise in developing complex machine learning algorithms to provide you with intelligent recommendation engine. Schedule an introductory call with our consultant to know how we can add value to your business and help you serve your customers better.

PrevNext