Accelerating Gemma 4: faster inference with multi-token prediction drafters

· ai · Source ↗

TLDR

  • Google releases MTP drafters for Gemma 4, delivering up to 3x inference speedup via speculative decoding with zero output quality degradation.

Key Takeaways

  • MTP drafters pair a lightweight drafter model with the heavy Gemma 4 target (e.g., 31B Dense, 26B MoE); the target verifies drafted tokens in one parallel forward pass.
  • Standard LLM inference is memory-bandwidth bound; speculative decoding exploits idle compute to draft multiple tokens while the target model processes one.
  • Drafter models share the target’s KV cache and activations, avoiding redundant context recomputation.
  • Edge models (E2B, E4B) get an additional embedding clustering optimization to cut logit calculation bottleneck.
  • Batch sizes of 4-8 unlock ~2.2x speedup on Apple Silicon for the 26B MoE; similar gains on Nvidia A100. Available now on Hugging Face, Kaggle, vLLM, MLX, SGLang, Ollama under Apache 2.0.

Hacker News Comment Review

  • MLX and llama.cpp support is not yet merged, so local Mac users cannot use MTP drafters yet despite UI options appearing in tools like LM Studio.
  • Commenters note that crossing 100 tokens/second is a qualitative threshold for usability; Qwen3 27B currently holds a speed edge over Gemma 4 on local hardware, making this release directly competitive.
  • Google’s choice to release Gemma 4 open-source without pushing paid cloud inference puzzles commenters; one theory is pricing conflict with Gemini Flash and low margin on small models.

Notable Comments

  • @dvt: MLX MTP support is still an open PR, confirming LM Studio’s drafter toggle is non-functional on Apple Silicon today.

Original | Discuss on HN