Movie recommender web application using IMDB dataset
pipenv install --dev
Populate environment variables:
$DATABASE_URL
: The JDBC mongo URI to connect to$ENVIRONMENT
: ‘production’ or ‘development’$DEBUG
: ‘true’ or ‘false’Run the application
./run.py [port]
The file data_final.json
houses the 173 movie database that we’ll be using for
the recommender system. These were fetched using an open API
There is a data dump in the data/
folder that can be used to directly populate
your database with the required collections.
mongorestore --db <db_name> db_dump/test
To load the database, just run the mongoimport
command to read form the JSON.
mongoimport --db <db_name> --collection movies --file data_final.json --jsonArray --drop
users.json
file.gen_users.py
to generate new users as well
./gen_users.py <no_users>
mongoimport --db <db_name> --collection users --file users.json --jsonArray
mongoDB
backed, Flask application that uses jinja
for its frontend rendering.ratings
key that contains a list of the all the movies he has rated with their respective ratings. This can be used (supplemented by additional ratings and movie information) to personalize recommendations.Each subset was looked up on imdb_top250 and the data for the movies obtained was dumped onto data_final.json
. Various movies were selected from the categories on the basis of their statistical share hold across the genres.
Thus 176 total movies were obtained which:
The entire dataset was injected into the MongoDB with a special onehot
vector which represented the frequency of each base genre occurence for the movie. This allows for a quantized representation of how strong the influence of the movie is in our dataset.
12 - 24 12 - 14 14 - 24 24-34 Ratio Category Male Female Ratio
15 22 8 15 1 Animation 75 65 1.153846154
33.5 35 32 30 1.1166 Comedy 90 91 0.989010989
8.5 5 12 11 0.7727 Crime 79 84 0.9404761905
4 0.5 7.5 5.5 0.7272 Horror 57 47 1.212765957
40 40 40 40 1 Action 90 86 1.046511628
18 18 18 18 1 SciFi 76 62 1.225806452
23.5 20 27 27 0.8703 Drama 80 89 0.8988764045
10.5 8 13 11 0.9545 Romance 55 77 0.7142857143
These probabilites were then used to distribute the various basic genres over the 500 users.
Next, to further prepopulate the users, 12 total genres were picked up from the IMDB dataset and an n x n matrix was formed denoting the probabilistic distribution of the genres with one another. Entry $A[i][j]$ represents the occurences of the $j$ given base tag $i$. Entry $(i, i)$ represents the probability of this tag occuring alone. The genre distribution was augmented with these probabilities and user genre sets were updated to include these as well.
json
to be imported into the database.Note: In case the user database size was substantial ($> 10^7$), even a random genre sampling would have been okay. But since we’re looking at a relatively small system, having this “non-random” _seed data will benefit the recommendation choice and give more organic results due to the natural nature of the “dummy” users._
It is a form of collaborative filtering for recommender systems which identifies other users with similar tastes to a target user and combines their ratings to make recommendations for that user.
Karl Pearson’s correlation is used to see how similar two users are; this normalizes user optimism and ensures equal-scaled similarity matching.
Given that our user genres have been modeled off real world statistics, this correlation metric is representative of a real-world setting.
It is a form of collaborative filtering for recommender systems based on the similarity between items calculated using user’s ratings of those items.
Item-Item models use rating distributions per item, not per user. With more users than items, each item tends to have more ratings than each user, so an item’s average rating usually doesn’t change quickly. This leads to more stable rating distributions in the model, so the model doesn’t have to be rebuilt as often. When users consume and then rate an item, that item’s similar items are picked from the existing system model and added to the user’s recommendations.
To calculate similarity between two items, we look into the set of items the target user has rated and computes how similar they are to the target item i and then selects k most similar items. Similarity between two items is calculated by taking the ratings of the users who have rated both the items and thereafter using a similarity function.
Matrix factorization is a class of recommender systems algorithms which work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. (Matrix Decomposition)
The strength of matrix factorization is the fact that it can incorporate implicit feedback, information that are not directly given but can be derived by analyzing user behavior. Using this strength we can estimate if a user is going to like a movie that (he/she) never saw. And if that estimated rating is high, we can recommend that movie to the user.
Technique used: SVD++
Given the end to end nature of the application, I am planning to develop it as a standalone platform as well. A major limitations of the application is live recalculation of the filtering algorithms, something that needs to be either database cached or triggered to precalculate certain values prior to calling of the filtering.