KV Sharing, MHC, and Compressed Attention

May 19, 2026 · ai ai-agents design · Source ↗

TLDR

Recent open-weight LLMs (Gemma 4, DeepSeek V4, ZAYA1, Laguna XS.2) are converging on architecture tricks to cut KV-cache memory and attention cost for long-context and agent workloads.

Gemma 4 E2B/E4B use cross-layer KV sharing: only the first ~15-24 layers compute KV projections; later layers reuse them, saving ~2.7-6 GB at 128K context in bfloat16.
Per-layer embeddings (PLE) in Gemma 4 E-series separate embedding capacity from transformer-stack compute, keeping active parameter cost near the smaller number while total params count higher.
DeepSeek V4 introduces mHC (multi-head compressed attention); ZAYA1-8B uses compressed convolutional attention; Laguna XS.2 applies layer-wise attention budgeting.
KV sharing is an approximation that reduces model capacity, though the cross-layer attention NeurIPS 2024 paper reports minimal impact on small models tested.
No ablation comparing Gemma 4 E2B to a plain 2.3B or 5.1B dense model is publicly available, leaving the PLE tradeoff unquantified.