ZAYA1-8B Matches DeepSeek-R1 on Math with Less Than 1B Active Parameters

· ai ai-agents coding · Source ↗

TLDR

  • Zyphra’s ZAYA1-8B runs at 760M active parameters (8.4B total MoE) and matches DeepSeek-R1 on AIME 2025 and HMMT math benchmarks.

Key Takeaways

  • MoE architecture activates only 760M of 8.4B parameters per token, keeping inference cost near sub-1B dense model levels.
  • Markovian RSA generates parallel reasoning traces in chunks, discarding only tail context to keep the window bounded – enabling scaling with compute budget rather than hitting a fixed ceiling.
  • Co-design matters: applying Markovian RSA to Qwen3-4B without co-training produced significantly smaller gains, so the method is not plug-and-play.
  • Agentic benchmarks are weak – BFCL-V4 at 39.22 and TAU2 at 43.12 trail Qwen3-4B-Thinking by ~10 points; not suitable for tool-calling or multi-step agent workflows.
  • Trained end-to-end on a 1,024-node AMD Instinct MI300X cluster using IBM and AMD Pensando Pollara interconnect – the most capable model publicly demonstrated on AMD hardware.

Hacker News Comment Review

  • Commenters agree the agentic gap is the real blocker for coding harness adoption; most production coding agents rely on tool calls for context gathering, where ZAYA1-8B currently underperforms.
  • Practical friction noted for local deployment: standard vLLM fails, requiring Zyphra’s own fork – worth verifying before integrating into existing inference stacks.
  • Broader sentiment leans toward small-model optimism, with Qwen3 27B already cited as a working single-GPU agentic coding option, framing ZAYA1-8B as part of a real trend rather than an outlier.

Notable Comments

  • @sva_: Notes potential to improve Markovian RSA beyond fixed tail-token cutting with a tunable parameter, suggesting the inference method is not yet fully optimized.

Original | Discuss on HN