How to Scale AI Application Inference 100x ft. Fireworks’ Lin Qiao

Name: How to Scale AI Application Inference 100x ft. Fireworks’ Lin Qiao
Uploaded: 2025-05-19T12:00:00.000000Z
Description: Fireworks AI CEO Lin Qiao argues inference cost can drop 100x through joint post-training and inference co-optimization tuned to each application’s data distribution. Inference optimization is a 3D pr…

May 19, 2025 · ai · Source ↗

Watch on YouTube ↗ Summary based on the YouTube transcript and episode description.

Fireworks AI CEO Lin Qiao argues inference cost can drop 100x through joint post-training and inference co-optimization tuned to each application’s data distribution.

Inference optimization is a 3D problem: quality, speed, and cost (concurrency) must be solved simultaneously, not independently.
Today’s inference cost is the ‘waterline’ — dropping it 10-100x unlocks a wave of applications that can’t yet reach sustainable scale.
The future scaling law is application-specific: align model training data distribution to your production workload distribution.
Off-the-shelf models + prompt engineering is weak; data flywheels and post-training tuned to your app are the actual moat.
Combinatorial explosion: speculative decoding, hardware selection, model sharding, distributed inference, kernel selection, and tuning mechanisms create 100,000+ optimization combinations.
A food-chain company scaled one AI feature from 1 store to 1,000 stores in 3 months on Fireworks.
A software dev company rolled out an AI feature from 100,000 to 25 million developers in 3 months on Fireworks.

2025-05-19 · Watch on YouTube

Related coverage

The World's Most Complex Machine

WASM is not quite a stack machine

Vibe Coding Will Break Your Company

San Francisco, AI capital of the world, is an economic laggard