Google accelerates Gemma 4 inference with new Multi-Token Prediction drafters
The latest update to the open-source Gemma 4 family addresses memory-bandwidth bottlenecks on consumer-grade hardware, offering developers a significant efficiency boost for local and cloud deployments.
Google has officially released Multi-Token Prediction (MTP) drafters for the Gemma 4 family of open-source models, marking a significant step forward in inference efficiency. This update introduces a specialised speculative decoding architecture designed to accelerate processing speeds while maintaining the high intelligence-per-parameter standards established by the recent Gemma 4 launch. The technology enables developers to achieve up to a 3x speedup in tokens-per-second without degrading output quality or reasoning logic.
The core innovation addresses the memory-bandwidth bottlenecks that typically constrain standard large language model inference, particularly on consumer-grade hardware. In conventional autoregressive generation, processors spend the majority of their time moving billions of parameters from VRAM to compute units to generate a single token, often under-utilising available compute power. The MTP drafters mitigate this inefficiency by pairing a heavy target model, such as the Gemma 4 31B, with a lightweight drafter. This drafter predicts multiple future tokens in parallel, which are then verified by the target model in a single forward pass.
This approach allows applications to output a full drafted sequence plus one additional token in the time it usually takes to generate a single one. By decoupling token generation from verification, the system utilises idle compute to predict obvious continuations and complex logic puzzles simultaneously. The drafters seamlessly utilise the target model's activations and share its KV cache, ensuring that context is not recalculated unnecessarily. This architectural enhancement is crucial for responsive mobile applications running entirely on-device and for production deployments where inference speed is a primary bottleneck.
Hardware-specific optimisations have been implemented to maximise performance across different platforms. For the 26B mixture-of-experts model on Apple Silicon, processing batch sizes of 4 to 8 unlocks up to a ~2.2x speedup locally, overcoming unique routing challenges found at batch size 1. Similar gains are observed on Nvidia A100 hardware when batch sizes are increased. Additionally, for E2B and E4B edge models, an efficient clustering technique in the embedder was implemented to accelerate generation where logit calculation presents a significant bottleneck.
The MTP drafters are available immediately under the Apache 2.0 licence, the same open-source agreement as the Gemma 4 models. Model weights can be downloaded via Hugging Face and Kaggle, while testing environments are available on Google AI Edge Gallery for Android and iOS. The release supports major software frameworks including Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama, allowing developers to integrate the faster inference capabilities into coding assistants, autonomous agents, and other rapid multi-step planning applications.


