Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

· ai llm · Source ↗

Article

TL;DR

Paper proposes sharing LLM KV cache prefill results across datacenters to eliminate redundant compute.

Key Takeaways

  • Cross-datacenter KV cache sharing could cut inference cost on repeated popular prompts significantly
  • Constraints are extreme: time-sensitive, massive per-user files, scope-limited to single user
  • Off-peak pricing arbitrage may yield bigger wins than geographic prefill distribution

Discussion

Top comments:

  • [martinald]: This is standard CDN caching logic applied to per-user, time-sensitive, massive LLM files

Discuss on HN