The Engineering Unlocks Behind DeepSeek | YC Decoded
Watch on YouTube ↗ Summary based on the YouTube transcript and episode description.
YC GP Diana Hu breaks down DeepSeek V3 and R1’s core engineering innovations and why the $5.5M training cost figure is misleading.
- DeepSeek’s $5.5M figure covers only the final V3 training run, excluding R&D and hardware costs likely in the hundreds of millions.
- V3 uses fp8 training with a periodic fp32 accumulation fix, cutting memory overhead without compounding numerical errors.
- Mixture-of-experts architecture activates only 37B of 671B parameters per token—11x fewer than Llama 3’s full 405B activation.
- Multi-head latent attention (MLA) compresses KV cache by 93.3% and boosts generation throughput 5.76x, first published in DeepSeek V2 (May 2024).
- R1-Zero achieved top-tier reasoning using pure RL with no human or AI reasoning examples—graded only on output accuracy and formatting via GRPO.
- A UC Berkeley lab reproduced R1-Zero’s core techniques in a smaller model for just $30.
- OpenAI released o3-mini two weeks after R1, outperforming R1 on key benchmarks, signaling rapid acceleration at the frontier.
2025-02-05 · Watch on YouTube