How to Scale AI Application Inference 100x ft. Fireworks’ Lin Qiao
Watch on YouTube ↗ Summary based on the YouTube transcript and episode description.
Fireworks AI CEO Lin Qiao argues inference cost can drop 100x through joint post-training and inference co-optimization tuned to each application’s data distribution.
- Inference optimization is a 3D problem: quality, speed, and cost (concurrency) must be solved simultaneously, not independently.
- Today’s inference cost is the ‘waterline’ — dropping it 10-100x unlocks a wave of applications that can’t yet reach sustainable scale.
- The future scaling law is application-specific: align model training data distribution to your production workload distribution.
- Off-the-shelf models + prompt engineering is weak; data flywheels and post-training tuned to your app are the actual moat.
- Combinatorial explosion: speculative decoding, hardware selection, model sharding, distributed inference, kernel selection, and tuning mechanisms create 100,000+ optimization combinations.
- A food-chain company scaled one AI feature from 1 store to 1,000 stores in 3 months on Fireworks.
- A software dev company rolled out an AI feature from 100,000 to 25 million developers in 3 months on Fireworks.
2025-05-19 · Watch on YouTube