Book Recommender Case Study

Book recommender demo with KPI tiles, seed book card for The Hunger Games, and four side-by-side algorithm result columns (hybrid, ALS, content, popularity) — Demo: pick any seed book and see top-10 recommendations from each algorithm rendered side by side, with per-item explanations and latency badges.

Book recommender mobile view stacked single-column with KPI tiles, seed card, and one algorithm column at a time — The same demo on mobile (iPhone 15 viewport): KPI tiles, search box, seed card, and stacked algorithm columns.

Problem

Recommender systems are one of the highest-value BI surfaces in any business that has a catalogue and engagement data, and they need to handle three concrete realities at the same time: dense head users with rich behaviour, brand-new cold users with no behaviour at all, and the long tail of items that few people have rated. A senior-grade recommender is not a single model; it is a portfolio of models with a defensible evaluation harness around them.

The project needed to demonstrate that portfolio end to end: data into a relational store, multiple algorithmic families implemented properly, a real evaluation framework with confidence intervals, cold-start handling, low-latency serving, and an explainable interactive demo.

Users or audience

The primary audience is hiring managers and technical interviewers evaluating practical recommender systems engineering: algorithm choice, evaluation rigor, cold-start handling, API design, ANN search, observability, and visualisation.

The interactive demo is built for portfolio demonstration on the public Goodbooks-10k dataset. It is not a production reading service.

Solution

Five recommender models are implemented behind a common protocol: a Bayesian-shrunk popularity baseline, a TF-IDF content model on title and author and tags, implicit-feedback ALS via the implicit library with rating-weighted confidence, a small PyTorch two-tower with weighted BPR loss, and a hybrid that blends normalised scores with explicit cold-user fallback.

Every model is evaluated on the same held-out test set with precision, recall, MAP, NDCG (all with 1,000-sample bootstrap 95% confidence intervals), coverage, novelty, diversity, and intra-list similarity. Cold users are reported as a separate segment. A FastAPI service on the VPS exposes scored similar-items and per-user recommendations from a FAISS ANN index built over the ALS and content vectors. A Next.js frontend on Cloudflare Pages renders the four algorithms side by side so the differences are immediately visible.

Architecture

Goodbooks-10k CSVs10,000 books, 5.9M explicit ratings, 53,366 users, 34,252 tags, 912k to-read wishlist signals

↓

PostgreSQL warehouseIdempotent psycopg COPY loader. Least-privilege bookrec_app role. Core, ml, and ops schemas. pg_trgm for title search

↓

Dataset builderPer-user 1-item holdout, seeded shuffles, cold-user segment (bottom-quartile train history) and cold-item segment broken out for separate metrics

↓

Algorithm zooPopularity (Bayes-shrunk mean), TF-IDF content with cosine, implicit ALS (96 factors, alpha=12), neural two-tower (48-dim, weighted BPR), and a learned hybrid blend with cold-start fallback

↓

Evaluation frameworkPrecision, recall, MAP, NDCG with 1k-sample bootstrap CIs, plus coverage, novelty, diversity, ILS. Latency p50/p95 measured per algorithm. Results persisted to ml.eval_metric

↓

FastAPI on the VPSJSON endpoints for /similar-items and /recommend/user, FAISS inner-product ANN over ALS factors and a 256-dim SVD of the TF-IDF vectors, request logging into ops.request_log, behind nginx with Let's Encrypt TLS

↓

Next.js on Cloudflare PagesStatic export (Next.js 14, TypeScript, Tailwind, Recharts), single-route dashboard with KPI tiles, search, seed card, four side-by-side algorithm columns, and a leaderboard table plus bar charts

Results

Evaluated on a 3,000-user held-out sample (each user's one held-out positive rating). All metrics carry 1,000-sample bootstrap 95% confidence intervals; coverage and latency are over the full rec lists.

0.0792Hybrid MAP@10 (vs. 0.0013 popularity baseline, ~61× lift).

+64%Hybrid lift in MAP@10 over standalone ALS, +76% over standalone content.

40 msHybrid p95 latency for top-10 from the 10,000-item catalogue.

42.9%Catalogue coverage from the hybrid (vs. 0.2% for popularity, 33% for ALS).

0.17Hybrid Recall@10 on held-out users; ALS Recall@10 = 0.14 baseline.

26Property-based and unit tests pin metric correctness and cold-start logic.

Tools used

Key features

Five recommender algorithms behind a single typed protocol so the evaluation harness treats them identically.
Bootstrap 95% confidence intervals on every ranking metric; coverage, novelty, diversity, and intra-list similarity reported alongside accuracy.
Explicit cold-user segment (bottom-quartile train history) and cold-item segment with separate metrics.
Cold-start logic in the hybrid: when a user has no train history, weights collapse to a content-plus-popularity blend instead of zero-vector collaborative noise.
FAISS inner-product ANN over ALS item factors and a 256-dim truncated SVD of the TF-IDF vectors for sub-millisecond similar-items lookup.
FastAPI endpoints with Pydantic-validated I/O, request logging into the ops.request_log table, and a healthcheck that lists loaded algorithms.
Next.js dashboard on Cloudflare Pages with KPI tiles, a search box, a seed-book card, and four side-by-side algorithm result columns with per-item explanations and live latency badges.
VPS deploy artefacts: a systemd unit for the API, a weekly retrain timer, an nginx site with TLS via Let's Encrypt, and a bash bootstrap script for the Postgres role and schema.
Property-based tests on every ranking metric (precision, recall, MAP, NDCG are bounded in [0, 1] across millions of generated cases) and pinned tests for cold-start behaviour.
Reproducibility via a single RANDOM_SEED that controls train/test splits, ALS init, and two-tower weights.

Tradeoffs and constraints

The Goodbooks-10k dataset has no rating timestamps, so the train/test split is per-user random-positive rather than temporal. A production deployment would need timestamps for a leakage-proof time split, an item popularity decay term, and an online learning loop for the long tail.

The two-tower model here is intentionally small (48-dim, 3 epochs, uniform random negatives) so the full leaderboard regenerates in under three minutes on CPU. A real deployment would push embedding dimension to 128 or higher, add side-feature towers (author, publisher, language, publication year), and train on GPU with popularity-weighted hard negatives.

Methodology

Appropriate use: portfolio demonstration of recommender systems engineering on a public dataset.

Inappropriate use: as a production reading recommendation service or as ground truth for editorial decisions; the dataset is from 2017 and item popularity has shifted since.

Limitations

The recommender operates on a static 2017 snapshot of Goodreads-style ratings. Cover images are loaded from the original Goodreads CDN and may rot. Ratings have no time component so the evaluation cannot test temporal drift.

The hybrid blend weights are hand-set, not learned end to end. Treating the blend weights as model parameters and training them with a small Adam loop on a validation set is the obvious next step.

What I would improve next

A learned hybrid blend with a small linear head trained on a held-out validation slice. A second two-tower model that consumes side features (author, language, publication year, tag bag-of-words) so the cold-item path stops relying purely on TF-IDF. A nightly ingestion job for new Goodreads exports so the catalogue stays fresh. A small offline canary that re-runs the evaluation harness against each new model artefact and refuses to promote it unless MAP@10 is within the bootstrap CI of the prior best.

Live Demo Back to Projects