Behind the screen

How it works

The tech behind Picture House

← Back to Picture House

Two ways to find a film

Each tab uses a different technique under the hood.

Search — what are you looking for?

You describe what kind of film you want in plain words — "funny film for a rainy evening" or "intense spy thriller". The app converts your query into a TF-IDF vector (a numerical representation of which words appear and how important they are), then finds the movies whose title and genre descriptions are closest to it using cosine similarity.

The results also factor in popularity: a film with thousands of ratings gets a small boost over an equally relevant but obscure title. The final score is a blend — 65% relevance to your query, 35% how widely watched the film is.

Similar films — more like this one

You enter a film you already like. The app computes similarity using three signals blended together:

Genome tags (50%) — MovieLens includes 1,128 crowd-sourced relevance tags per film: descriptors like "feel-good", "thought-provoking", "plot twist", or "atmospheric". Each film gets a score (0–1) for every tag. These 1,128-dimensional vectors are compressed to 100 dimensions with SVD before computing cosine similarity — capturing nuanced taste rather than just broad genre.

Collaborative filtering (35%) — based on the ratings of 25 million users. Films that the same audiences tend to rate similarly end up close together in the latent space. This is the same principle behind Netflix and Spotify recommendations. Technically: a sparse user–item matrix is decomposed with Truncated SVD, and the resulting item factors are used for cosine similarity.

Genre overlap (15%) — a lightweight genre vector (e.g. Action + Sci-Fi + Thriller) as a tie-breaker.

The dataset

Movie data comes from MovieLens 25M, a public dataset released by GroupLens Research at the University of Minnesota. It contains 25 million ratings across 62,000 films. The app uses a filtered subset — films with enough ratings to be meaningful — along with genre metadata.

Posters, overviews, and TMDB ratings are fetched live from the TMDB API (The Movie Database) and cached in memory for the duration of each server session.

Tech stack

Python · Flask
Backend API and static file serving
scikit-learn
TF-IDF, SVD, cosine similarity
pandas · numpy
Data loading and matrix operations
TMDB API
Posters, overviews, and ratings
Gunicorn
WSGI server for production
Railway
Cloud deployment platform

Why TF-IDF instead of sentence embeddings?
An earlier version used all-MiniLM-L6-v2 via the sentence-transformers library for semantic search. Loading PyTorch and the transformer model required ~400 MB of RAM — more than the free tier allows. TF-IDF achieves comparable results for genre and title matching at a fraction of the memory footprint, with no GPU or heavy dependencies required.