Two ways to find a film
Each tab uses a different technique under the hood.
Search — what are you looking for?
You describe what kind of film you want in plain words — "funny film for a rainy evening" or "intense spy thriller". The app converts your query into a TF-IDF vector (a numerical representation of which words appear and how important they are), then finds the movies whose title and genre descriptions are closest to it using cosine similarity.
The results also factor in popularity: a film with thousands of ratings gets a small boost over an equally relevant but obscure title. The final score is a blend — 65% relevance to your query, 35% how widely watched the film is.
Similar films — more like this one
You enter a film you already like. The app computes similarity using three signals blended together:
Genome tags (50%) — MovieLens includes 1,128 crowd-sourced relevance tags per film: descriptors like "feel-good", "thought-provoking", "plot twist", or "atmospheric". Each film gets a score (0–1) for every tag. These 1,128-dimensional vectors are compressed to 100 dimensions with SVD before computing cosine similarity — capturing nuanced taste rather than just broad genre.
Collaborative filtering (35%) — based on the ratings of 25 million users. Films that the same audiences tend to rate similarly end up close together in the latent space. This is the same principle behind Netflix and Spotify recommendations. Technically: a sparse user–item matrix is decomposed with Truncated SVD, and the resulting item factors are used for cosine similarity.
Genre overlap (15%) — a lightweight genre vector (e.g. Action + Sci-Fi + Thriller) as a tie-breaker.
The dataset
Movie data comes from MovieLens 25M, a public dataset released by GroupLens Research at the University of Minnesota. It contains 25 million ratings across 62,000 films. The app uses a filtered subset — films with enough ratings to be meaningful — along with genre metadata.
Posters, overviews, and TMDB ratings are fetched live from the TMDB API (The Movie Database) and cached in memory for the duration of each server session.
Tech stack
Why TF-IDF instead of sentence embeddings?
An earlier version used all-MiniLM-L6-v2 via the
sentence-transformers library for semantic search.
Loading PyTorch and the transformer model required ~400 MB of RAM —
more than the free tier allows. TF-IDF achieves comparable results
for genre and title matching at a fraction of the memory footprint,
with no GPU or heavy dependencies required.