Back

2026-06-20

Building A Hybrid Two-Tower Anime Recommender

From MAL ratings to FAISS + ONNX inference, a production baseline recommender with honest offline metrics and a live portfolio demo.

Why I Built This

Recommendation systems sit at the intersection of data engineering, representation learning, and product UX. I wanted a portfolio piece that shows the full loop: ingest real community data, train a ranking model with held-out evaluation, export lean inference artifacts, and expose them through a secured API that a Next.js portfolio can demo without leaking secrets.

Sagasu (探す, “to search”) is that system, scoped as a personal baseline, not a state-of-the-art benchmark.

Architecture

The pipeline splits cleanly between local MLOps and deployed inference:

  • Data lake (local SeaweedFS): raw MAL CSVs, processed Parquet features, artifact backups under s3://sagasu-lake
  • Feature jobs: pandas-based joins and SBERT synopsis embeddings (PySpark deferred on Windows)
  • Training: hybrid two-tower model with BPR loss, MLflow tracking, warm-user metrics, and a second-stage cold-start adapter for the portfolio demo
  • Serving: FAISS IndexFlatIP over normalized 128-d item embeddings + trained ONNX cold-start user tower
  • API: FastAPI on Heroku Basic with API-key auth, rate limits, and Redis-compatible cache hooks
  • Portfolio: Next.js BFF routes proxy requests server-side; the browser never sees SAGASU_API_KEY

Try Sagasu

Search titles, pick up to 5 favorites, and get hybrid two-tower recommendations.

    Data and Features

    The baseline uses MyAnimeList community ratings and anime metadata:

    • ~825k rating rows ingested; ~543k training pairs after filtering
    • 4,274 users and 30,130 anime in the production feature tables
    • User history vectors, genre multi-hot features, SBERT synopsis embeddings, and English/Japanese title aliases feed the pipeline

    Model and Offline Metrics

    Training uses Bayesian Personalized Ranking (BPR) on implicit feedback derived from ratings ≥ 7. The deployed demo uses an anonymous cold-start path: pick 1-5 liked titles, pool their item embeddings, run that context through a trained ONNX user tower, and retrieve nearest anime with FAISS.

    MetricValidationTest
    Warm HitRate@102.81%1.87%
    Warm NDCG@100.00300.0019
    Cold-start HitRate@1019.67%14.72%
    Cold-start NDCG@100.03180.0209
    Cold-start Recall@1007.94%7.63%

    Reading the table: users are split 80/10/10 before training. Val and test users never appear in BPR, so warm scores rank with untrained user-ID embeddings, a standard user-holdout check for “new user” collaborative filtering, which stays low (~2–3%) here. Cold-start scores build the query from liked titles (same as the sandbox), so they measure item-similarity retrieval and read much higher. The two rows answer different questions; don’t compare cold-start to industry warm-user benchmarks.

    HitRate@10 is offline recall@10 over ~30k candidates, not subjective accuracy. Judge the live demo with cold-start metrics.

    ONNX export matches PyTorch within ~3.0×10⁻⁸ max diff; FAISS index covers all 30,130 production items.

    Inference and Latency

    Latest smoke tests:

    • Local: /health ~8 ms · /ready ~1 ms · search ~171 ms · recommend ~5 ms
    • Heroku Basic: /health ~390 ms · /ready ~83 ms · search ~399 ms · recommend ~104 ms