Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter
Article
TL;DR
Paper proposes sharing LLM KV cache prefill results across datacenters to eliminate redundant compute.
Key Takeaways
- Cross-datacenter KV cache sharing could cut inference cost on repeated popular prompts significantly
- Constraints are extreme: time-sensitive, massive per-user files, scope-limited to single user
- Off-peak pricing arbitrage may yield bigger wins than geographic prefill distribution
Discussion
Top comments:
- [martinald]: This is standard CDN caching logic applied to per-user, time-sensitive, massive LLM files