Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Apr 23, 2026 · ai hardware · Source ↗

Article

TL;DR

Paper proposes decoupling KV cache prefill from decode across datacenters to cut LLM inference cost.

Key Takeaways

Prefill is compute-heavy and cacheable; routing it to idle capacity reduces per-token cost
Analogous to per-user CDN edge caching for live video — huge files, time-sensitive, user-scoped
Real economic win likely comes from time-of-use pricing, not geography alone

Discussion

Top comments:

[martinald]: This is standard CDN caching logic applied to per-user, time-sensitive, huge LLM files