Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

· ai hardware · Source ↗

Article

TL;DR

Paper proposes decoupling KV cache prefill from decode across datacenters to cut LLM inference cost.

Key Takeaways

  • Prefill is compute-heavy and cacheable; routing it to idle capacity reduces per-token cost
  • Analogous to per-user CDN edge caching for live video — huge files, time-sensitive, user-scoped
  • Real economic win likely comes from time-of-use pricing, not geography alone

Discussion

Top comments:

  • [martinald]: This is standard CDN caching logic applied to per-user, time-sensitive, huge LLM files

Discuss on HN