arXiv Recommender Case Study

arXiv recommender dashboard with KPI tiles, seed paper card, and four side-by-side algorithm columns (hybrid, neural, TF-IDF, citation ALS) — Demo: pick any arXiv paper and see top-10 most-similar papers rendered side by side by four algorithms, with per-item explanations and live latency badges.

arXiv recommender mobile view stacked single column with KPI tiles, seed card, and stacked algorithm columns — The same demo on mobile (iPhone 15 viewport): KPI tiles, search box, seed card, and stacked algorithm columns.

Problem

Research is moving faster than any one person can read. The default arXiv "related papers" surface is content similarity in one shape, but academic search has two strong signals: text (titles and abstracts) and citation relationships. A senior-grade recommender uses both, evaluates them honestly, and is fast enough to serve from a single VPS.

The project needed to demonstrate that end to end: a real scholarly dataset, multiple algorithmic families implemented properly, a held-out evaluation harness with confidence intervals, low-latency serving, and an explainable interactive demo.

Users or audience

The primary audience is hiring managers and technical interviewers evaluating practical recommender systems engineering: algorithm choice, evaluation rigor, cold-start handling, API design, ANN search, observability, and visualisation.

The interactive demo is built for portfolio demonstration on a public OpenAlex snapshot of recent Computer Science arXiv papers. It is not a production literature-discovery service.

Solution

Five recommender models are implemented behind a common protocol: a popularity baseline by cited-by count, a classical TF-IDF model on title and abstract and authors and topic, a MiniLM (sentence-transformers/all-MiniLM-L6-v2) content tower over abstracts, implicit-feedback ALS over a bipartite citation graph (citing papers as users, cited papers as items), and a hybrid that blends min-max normalised scores from all four.

Every model is evaluated on the same held-out test set with precision, recall, MAP, NDCG (all with 1,000-sample bootstrap 95% confidence intervals), coverage, diversity, and intra-list similarity. A FastAPI service on a Linux VPS exposes scored similar-items queries from a FAISS inner-product ANN index. A Next.js frontend on Cloudflare Pages renders the four algorithms side by side so the differences are immediately visible.

Architecture

OpenAlex Computer Science works hosted on arXiv33,000 publications from 2019 onwards with at least 3 citations, pulled via the polite-pool cursor API with abstract reconstruction from the inverted index

↓

PostgreSQL warehouseIdempotent loader using psycopg COPY. Least-privilege arxrec_app role. Core, ml, and ops schemas. pg_trgm for title search

↓

Dataset builderPer-seed 10% citation-edge holdout, seeded shuffles, cold-seed segment (citing papers with very few train edges) broken out for separate metrics

↓

Algorithm zooPopularity by cited-by count, TF-IDF on title plus abstract plus authors plus topic, MiniLM sentence-transformer over abstracts, implicit ALS over the citation graph, and a learned hybrid blend

↓

Evaluation frameworkPrecision, recall, MAP, NDCG with 1k-sample bootstrap CIs, plus coverage, diversity, ILS. Latency p50 and p95 measured per algorithm. Results persisted to ml.eval_metric

↓

FastAPI on the VPSJSON endpoints for /similar and /papers, FAISS IndexFlatIP over MiniLM and ALS vectors, request logging into ops.request_log, behind nginx with Let's Encrypt TLS

↓

Next.js on Cloudflare PagesStatic export (Next.js 14, TypeScript, Tailwind, Recharts), single-route dashboard with KPI tiles, search, seed card, four side-by-side algorithm columns, and a leaderboard table plus bar charts

Results

Evaluated as a similar-items task. For each test seed we hold out 10% of its outgoing citation edges before training, then ask whether the held-out citations appear in the top 10 when the seed is used as the query. Metrics carry 1,000-sample bootstrap 95% confidence intervals.

~20×Best algorithm MAP@10 lift over the popularity baseline.

p95 < 100 msTop-10 similar-paper response from a 28k-paper catalogue.

17Pytest + hypothesis tests; property-based on every ranking metric.

46,406Citation edges inside the eval subset (closed subgraph).

28,424Papers in the eval set after quality filtering.

5Algorithms benchmarked head to head on the same held-out split.

Tools used

Key features

Five recommender algorithms behind a single typed protocol so the evaluation harness treats them identically.
Bootstrap 95% confidence intervals on every ranking metric; coverage, diversity, and intra-list similarity reported alongside accuracy.
Citation-graph collaborative filtering: citing papers as users, cited papers as items, implicit ALS over a bipartite matrix.
Sentence-transformer content tower (MiniLM, 384-d, L2-normalised) on titles plus abstracts so the same FAISS index pattern works for both content and collaborative similarity.
Cold-seed handling: the hybrid blend down-weights ALS when the seed has very few train edges, content carries the recommendation.
FastAPI endpoints with Pydantic-validated I/O, request logging into the ops.request_log table, and a healthcheck that lists loaded algorithms.
Next.js dashboard on Cloudflare Pages with KPI tiles, a search box, a seed-paper card, and four side-by-side algorithm result columns with per-item explanations and live latency badges.
VPS deploy artefacts: a systemd unit for the API, a weekly refresh timer that re-pulls OpenAlex and retrains, an nginx site with TLS via Let's Encrypt, and a bash bootstrap script for the Postgres role and schema.
Property-based tests on every ranking metric (precision, recall, MAP, NDCG bounded in [0, 1] across hundreds of generated cases) and pinned tests for ALS, TF-IDF, and top-k correctness.
Reproducibility via a single RANDOM_SEED that controls the train/test split, ALS init, and bootstrap sampling.

Tradeoffs and constraints

OpenAlex is the source of truth for both metadata and the citation graph. Most papers cite older work outside our 2019 plus subset, so the in-set citation graph is sparser than a full graph would be (roughly 1.6 edges per paper). Collaborative ALS therefore underperforms content here. The hybrid is set up to absorb a denser graph without other code changes; a real production deployment would widen the date window or pull a second hop of citations as shadow nodes.

The MiniLM encoder is intentionally small (22M parameters, 384-d output) so the full leaderboard regenerates in under ten minutes on CPU. A real deployment would swap to a stronger SPECTER-style scholarly encoder and run on GPU, and would extend the hybrid head from a fixed linear blend to a tiny trainable model fit on a held-out validation slice.

Methodology

Appropriate use: portfolio demonstration of recommender systems engineering on a public scholarly dataset.

Inappropriate use: as a production literature-search service or as ground truth for hiring, funding, or editorial decisions; the snapshot is point-in-time and the citation graph is intentionally restricted to a closed subset.

Limitations

The recommender operates on an OpenAlex snapshot of Computer Science arXiv papers published since 2019 with at least three citations. The citation graph is restricted to in-subset edges, so collaborative coverage is lower than a full-graph deployment would be. Cover and PDF links resolve through the original source URLs and may rot over time.

The hybrid blend weights are hand-set (45% neural / 35% ALS / 15% TF-IDF / 5% popularity), not learned. Replacing the fixed blend with a small linear head trained on a held-out validation slice is the obvious next step.

What I would improve next

A learned hybrid head trained end to end. A second hop of citation edges so cold seeds get a denser graph view. SPECTER-style scholarly encoders for higher-quality abstract embeddings. An author-recommendation surface that asks "which authors should this seed read next" (the same ALS factors, exposed differently). A nightly OpenAlex refresh job so the catalogue stays current, and an offline canary that refuses to promote a new model unless MAP@10 is within the bootstrap CI of the prior best.

Live Demo Back to Projects