Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

· ai llm hardware · Source ↗

Article

TL;DR

Sharing KV cache prefill results across datacenters could slash redundant inference compute costs.

Key Takeaways

  • Prefill is expensive; caching and distributing it like CDN video chunks reduces repeat costs
  • Per-user, time-sensitive, massive-file constraints make this harder than standard CDN caching
  • Time-of-use inference pricing may matter more than cross-DC prefill sharing in practice

Discussion

Top comments:

  • [martinald]: Standard CDN caching logic but with time-sensitive, per-user, huge files

Discuss on HN