The Engineering Unlocks Behind DeepSeek | YC Decoded

· startups · Source ↗

Watch on YouTube ↗ Summary based on the YouTube transcript and episode description.

YC GP Diana Hu breaks down DeepSeek V3 and R1’s core engineering innovations and why the $5.5M training cost figure is misleading.

  • DeepSeek’s $5.5M figure covers only the final V3 training run, excluding R&D and hardware costs likely in the hundreds of millions.
  • V3 uses fp8 training with a periodic fp32 accumulation fix, cutting memory overhead without compounding numerical errors.
  • Mixture-of-experts architecture activates only 37B of 671B parameters per token—11x fewer than Llama 3’s full 405B activation.
  • Multi-head latent attention (MLA) compresses KV cache by 93.3% and boosts generation throughput 5.76x, first published in DeepSeek V2 (May 2024).
  • R1-Zero achieved top-tier reasoning using pure RL with no human or AI reasoning examples—graded only on output accuracy and formatting via GRPO.
  • A UC Berkeley lab reproduced R1-Zero’s core techniques in a smaller model for just $30.
  • OpenAI released o3-mini two weeks after R1, outperforming R1 on key benchmarks, signaling rapid acceleration at the frontier.

2025-02-05 · Watch on YouTube