Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Article
TL;DR
Paper proposes decoupling KV cache prefill from decode across datacenters to cut LLM inference cost.
Key Takeaways
- Prefill is compute-heavy and cacheable; routing it to idle capacity reduces per-token cost
- Analogous to per-user CDN edge caching for live video — huge files, time-sensitive, user-scoped
- Real economic win likely comes from time-of-use pricing, not geography alone
Discussion
Top comments:
- [martinald]: This is standard CDN caching logic applied to per-user, time-sensitive, huge LLM files