Building A Hybrid Two-Tower Anime Recommender

From MAL ratings to FAISS + ONNX inference, a production baseline recommender with honest offline metrics and a live portfolio demo.

Why I Built This

Recommendation systems sit at the intersection of data engineering, representation learning, and product UX. I wanted a portfolio piece that shows the full loop: ingest real community data, train a ranking model with held-out evaluation, export lean inference artifacts, and expose them through a secured API that a Next.js portfolio can demo without leaking secrets.

Sagasu (探す, “to search”) is that system, scoped as a personal baseline, not a state-of-the-art benchmark.

Architecture

The pipeline splits cleanly between local MLOps and deployed inference:

Data lake (local SeaweedFS): raw MAL CSVs, processed Parquet features, artifact backups under s3://sagasu-lake
Feature jobs: pandas-based joins and SBERT synopsis embeddings (PySpark deferred on Windows)
Training: hybrid two-tower model with BPR loss, MLflow tracking, warm-user metrics, and a second-stage cold-start adapter for the portfolio demo
Serving: FAISS IndexFlatIP over normalized 128-d item embeddings + trained ONNX cold-start user tower
API: FastAPI on Heroku Basic with API-key auth, rate limits, and Redis-compatible cache hooks
Portfolio: Next.js BFF routes proxy requests server-side; the browser never sees SAGASU_API_KEY

Try Sagasu

Search titles, pick up to 5 favorites, and get hybrid two-tower recommendations.

Data and Features

The baseline uses MyAnimeList community ratings and anime metadata:

~825k rating rows ingested; ~543k training pairs after filtering
4,274 users and 30,130 anime in the production feature tables
User history vectors, genre multi-hot features, SBERT synopsis embeddings, and English/Japanese title aliases feed the pipeline

Model and Offline Metrics

Training uses Bayesian Personalized Ranking (BPR) on implicit feedback derived from ratings ≥ 7. The deployed demo uses an anonymous cold-start path: pick 1-5 liked titles, pool their item embeddings, run that context through a trained ONNX user tower, and retrieve nearest anime with FAISS.

Metric	Validation	Test
Warm HitRate@10	2.81%	1.87%
Warm NDCG@10	0.0030	0.0019
Cold-start HitRate@10	19.67%	14.72%
Cold-start NDCG@10	0.0318	0.0209
Cold-start Recall@100	7.94%	7.63%

Reading the table: users are split 80/10/10 before training. Val and test users never appear in BPR, so warm scores rank with untrained user-ID embeddings, a standard user-holdout check for “new user” collaborative filtering, which stays low (~2–3%) here. Cold-start scores build the query from liked titles (same as the sandbox), so they measure item-similarity retrieval and read much higher. The two rows answer different questions; don’t compare cold-start to industry warm-user benchmarks.

HitRate@10 is offline recall@10 over ~30k candidates, not subjective accuracy. Judge the live demo with cold-start metrics.

ONNX export matches PyTorch within ~3.0×10⁻⁸ max diff; FAISS index covers all 30,130 production items.

Inference and Latency

Latest smoke tests:

Local: /health ~8 ms · /ready ~1 ms · search ~171 ms · recommend ~5 ms
Heroku Basic: /health ~390 ms · /ready ~83 ms · search ~399 ms · recommend ~104 ms