Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter
Article
TL;DR
Paper proposes CDN-style cross-datacenter KV cache sharing to eliminate redundant LLM prefill compute.
Key Takeaways
- KV caches are massive, time-sensitive, and per-user — analogous to live video CDN per session
- Cross-datacenter prefill sharing could eliminate redundant cost for shared system prompts at scale
- Top commenter calls it standard caching applied to LLMs — not fundamentally new architecture
Discussion
Top comments:
-
[martinald]: Standard CDN caching concepts applied to LLMs — time-sensitive huge per-user files
Sort of reminds me of video streaming on CDNs for live video (but per user)?