Post examines the scaling trade-off between model parameter count and training compute budget in large language models.
Key Takeaways
The core tension: given a fixed compute budget, you can train a larger model for fewer steps or a smaller model longer.
Kaplan et al. (2020) scaling laws suggested parameter count dominated; later work (Chinchilla, 2022) challenged this by showing models were undertrained.
Optimal allocation depends on whether you optimize for training cost or inference cost – a smaller, well-trained model wins on inference at scale.
The question is practically relevant for teams choosing between buying more GPUs for longer runs versus upgrading to larger architectures.