KV Sharing, MHC, and Compressed Attention

· ai ai-agents design · Source ↗

TLDR

  • Recent open-weight LLMs (Gemma 4, DeepSeek V4, ZAYA1, Laguna XS.2) are converging on architecture tricks to cut KV-cache memory and attention cost for long-context and agent workloads.

Key Takeaways

  • Gemma 4 E2B/E4B use cross-layer KV sharing: only the first ~15-24 layers compute KV projections; later layers reuse them, saving ~2.7-6 GB at 128K context in bfloat16.
  • Per-layer embeddings (PLE) in Gemma 4 E-series separate embedding capacity from transformer-stack compute, keeping active parameter cost near the smaller number while total params count higher.
  • DeepSeek V4 introduces mHC (multi-head compressed attention); ZAYA1-8B uses compressed convolutional attention; Laguna XS.2 applies layer-wise attention budgeting.
  • KV sharing is an approximation that reduces model capacity, though the cross-layer attention NeurIPS 2024 paper reports minimal impact on small models tested.
  • No ablation comparing Gemma 4 E2B to a plain 2.3B or 5.1B dense model is publicly available, leaving the PLE tradeoff unquantified.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN