Tech

Google accelerates Gemma 4 inference with new Multi-Token Prediction drafters

The latest update to the open-source Gemma 4 family addresses memory-bandwidth bottlenecks on consumer-grade hardware, offering developers a significant efficiency boost for local and cloud deployments.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Research

Related coverage

Explore Artificial Intelligence coverage Explore Research coverage More from the Tech desk

Tech

No image available

New speculative decoding architecture delivers up to three times faster output speeds without compromising model quality or reasoning logic.

Google has officially released Multi-Token Prediction (MTP) drafters for the Gemma 4 family of open-source models, marking a significant step forward in inference efficiency. This update introduces a specialised speculative decoding architecture designed to accelerate processing speeds while maintaining the high intelligence-per-parameter standards established by the recent Gemma 4 launch. The technology enables developers to achieve up to a 3x speedup in tokens-per-second without degrading output quality or reasoning logic.

The core innovation addresses the memory-bandwidth bottlenecks that typically constrain standard large language model inference, particularly on consumer-grade hardware. In conventional autoregressive generation, processors spend the majority of their time moving billions of parameters from VRAM to compute units to generate a single token, often under-utilising available compute power. The MTP drafters mitigate this inefficiency by pairing a heavy target model, such as the Gemma 4 31B, with a lightweight drafter. This drafter predicts multiple future tokens in parallel, which are then verified by the target model in a single forward pass.

This approach allows applications to output a full drafted sequence plus one additional token in the time it usually takes to generate a single one. By decoupling token generation from verification, the system utilises idle compute to predict obvious continuations and complex logic puzzles simultaneously. The drafters seamlessly utilise the target model's activations and share its KV cache, ensuring that context is not recalculated unnecessarily. This architectural enhancement is crucial for responsive mobile applications running entirely on-device and for production deployments where inference speed is a primary bottleneck.

Hardware-specific optimisations have been implemented to maximise performance across different platforms. For the 26B mixture-of-experts model on Apple Silicon, processing batch sizes of 4 to 8 unlocks up to a ~2.2x speedup locally, overcoming unique routing challenges found at batch size 1. Similar gains are observed on Nvidia A100 hardware when batch sizes are increased. Additionally, for E2B and E4B edge models, an efficient clustering technique in the embedder was implemented to accelerate generation where logit calculation presents a significant bottleneck.

The MTP drafters are available immediately under the Apache 2.0 licence, the same open-source agreement as the Gemma 4 models. Model weights can be downloaded via Hugging Face and Kaggle, while testing environments are available on Google AI Edge Gallery for Android and iOS. The release supports major software frameworks including Hugging Face Transformers, MLX, vLLM, SGLang, and Ollama, allowing developers to integrate the faster inference capabilities into coding assistants, autonomous agents, and other rapid multi-step planning applications.

Google accelerates Gemma 4 inference with new Multi-Token Prediction drafters

More from Tech

Apple opens developer access to iOS, iPadOS and macOS 27 betas

Apple confirms macOS 27 Golden Gate requires Apple Silicon, ending Intel support

Apple unveils watchOS 27 with Siri AI integration and hardware restrictions