Google releases MTP drafters for Gemma 4, delivering up to 3x inference speedup via speculative decoding with zero output quality degradation.
Key Takeaways
MTP drafters pair a lightweight drafter model with the heavy Gemma 4 target (e.g., 31B Dense, 26B MoE); the target verifies drafted tokens in one parallel forward pass.
Standard LLM inference is memory-bandwidth bound; speculative decoding exploits idle compute to draft multiple tokens while the target model processes one.
Drafter models share the target’s KV cache and activations, avoiding redundant context recomputation.
Edge models (E2B, E4B) get an additional embedding clustering optimization to cut logit calculation bottleneck.
Batch sizes of 4-8 unlock ~2.2x speedup on Apple Silicon for the 26B MoE; similar gains on Nvidia A100. Available now on Hugging Face, Kaggle, vLLM, MLX, SGLang, Ollama under Apache 2.0.
Hacker News Comment Review
MLX and llama.cpp support is not yet merged, so local Mac users cannot use MTP drafters yet despite UI options appearing in tools like LM Studio.
Commenters note that crossing 100 tokens/second is a qualitative threshold for usability; Qwen3 27B currently holds a speed edge over Gemma 4 on local hardware, making this release directly competitive.
Google’s choice to release Gemma 4 open-source without pushing paid cloud inference puzzles commenters; one theory is pricing conflict with Gemini Flash and low margin on small models.
Notable Comments
@dvt: MLX MTP support is still an open PR, confirming LM Studio’s drafter toggle is non-functional on Apple Silicon today.