More posts▾
2026-06-20
Building A Hybrid Two-Tower Anime Recommender
From MAL ratings to FAISS + ONNX inference, a production baseline recommender with honest offline metrics and a live portfolio demo.
Why I Built This
Recommendation systems sit at the intersection of data engineering, representation learning, and product UX. I wanted a portfolio piece that shows the full loop: ingest real community data, train a ranking model with held-out evaluation, export lean inference artifacts, and expose them through a secured API that a Next.js portfolio can demo without leaking secrets.
Sagasu (探す, “to search”) is that system, scoped as a personal baseline, not a state-of-the-art benchmark.
Architecture
The pipeline splits cleanly between local MLOps and deployed inference:
- Data lake (local SeaweedFS): raw MAL CSVs, processed Parquet features, artifact backups under
s3://sagasu-lake - Feature jobs: pandas-based joins and SBERT synopsis embeddings (PySpark deferred on Windows)
- Training: hybrid two-tower model with BPR loss, MLflow tracking, warm-user metrics, and a second-stage cold-start adapter for the portfolio demo
- Serving: FAISS
IndexFlatIPover normalized 128-d item embeddings + trained ONNX cold-start user tower - API: FastAPI on Heroku Basic with API-key auth, rate limits, and Redis-compatible cache hooks
- Portfolio: Next.js BFF routes proxy requests server-side; the browser never sees
SAGASU_API_KEY
Try Sagasu
Search titles, pick up to 5 favorites, and get hybrid two-tower recommendations.
Data and Features
The baseline uses MyAnimeList community ratings and anime metadata:
- ~825k rating rows ingested; ~543k training pairs after filtering
- 4,274 users and 30,130 anime in the production feature tables
- User history vectors, genre multi-hot features, SBERT synopsis embeddings, and English/Japanese title aliases feed the pipeline
Model and Offline Metrics
Training uses Bayesian Personalized Ranking (BPR) on implicit feedback derived from ratings ≥ 7. The deployed demo uses an anonymous cold-start path: pick 1-5 liked titles, pool their item embeddings, run that context through a trained ONNX user tower, and retrieve nearest anime with FAISS.
| Metric | Validation | Test |
|---|---|---|
| Warm HitRate@10 | 2.81% | 1.87% |
| Warm NDCG@10 | 0.0030 | 0.0019 |
| Cold-start HitRate@10 | 19.67% | 14.72% |
| Cold-start NDCG@10 | 0.0318 | 0.0209 |
| Cold-start Recall@100 | 7.94% | 7.63% |
Reading the table: users are split 80/10/10 before training. Val and test users never appear in BPR, so warm scores rank with untrained user-ID embeddings, a standard user-holdout check for “new user” collaborative filtering, which stays low (~2–3%) here. Cold-start scores build the query from liked titles (same as the sandbox), so they measure item-similarity retrieval and read much higher. The two rows answer different questions; don’t compare cold-start to industry warm-user benchmarks.
HitRate@10 is offline recall@10 over ~30k candidates, not subjective accuracy. Judge the live demo with cold-start metrics.
ONNX export matches PyTorch within ~3.0×10⁻⁸ max diff; FAISS index covers all 30,130 production items.
Inference and Latency
Latest smoke tests:
- Local:
/health~8 ms ·/ready~1 ms · search ~171 ms · recommend ~5 ms - Heroku Basic:
/health~390 ms ·/ready~83 ms · search ~399 ms · recommend ~104 ms