OpenAI vs. Deepseek vs. Qwen: Comparing Open Source LLM Architectures
YC’s Ankit Gupta compares GPT OSS, Qwen 3, and DeepSeek V3 architectures, finding similar benchmark results from surprisingly different engineering choices.
- GPT OSS is OpenAI’s first open-weights model since GPT-2 in 2019, available at 120B and 20B parameter MoE sizes.
- DeepSeek V3 activates only 37B of its 671B parameters per token; V3.1 adds hybrid thinking mode and two-phase long-context training.
- Qwen 3 was trained on 36 trillion tokens — twice Qwen 2.5 — including trillions of synthetic tokens generated by prior Qwen models.
- Qwen’s RL reasoning stage used only ~4,000 query-verifier pairs, suggesting strong results require far less data than expected.
- The three models reach 128K context via different routes: GPT OSS bakes it in at pre-training; DeepSeek stages it via fine-tuning; Qwen applies YaRN scaling at inference without extra retraining.
- DeepSeek V3 uses MLA (multi-head latent attention) to compress KV cache into a smaller latent space, outperforming GQA on memory and modeling at scale.
- Dataset engineering is likely the core moat: labs reveal architecture but obscure data, making replication hard despite open weights.
- Gupta’s meta-observation: top models use broadly the same tools yet achieve similar benchmarks via very different methods — with no first-principles explanation for why any one method wins.
2025-08-29 · Watch on YouTube