Tech

Google accelerates local AI with Gemma 4 Multi-Token Prediction models

Google has updated its Gemma 4 family of open-source AI models with Multi-Token Prediction drafters, leveraging speculative decoding to significantly boost inference speeds on local devices.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Ars Technica · original
Google's Gemma 4 AI models get 3x speed boost by predicting future tokens
New open-source releases promise up to three times faster generation on consumer hardware without compromising output quality.

Google has released updated Gemma 4 open-source AI models featuring Multi-Token Prediction (MTP) drafters, a technology designed to accelerate text generation on local hardware. By utilising speculative decoding, these models can predict future tokens in parallel with the main verification process, enabling generation speeds of up to three times faster on consumer GPUs and mobile devices without any reported loss in output quality.

The new models are licensed under Apache 2.0, a significant shift from the custom, less permissive licensing used in previous Gemma releases. This change aligns the software with industry standards and facilitates easier local tinkering and distribution. The Apache 2.0 license removes prior restrictions, allowing developers to deploy the models more freely across various environments.

To address the bandwidth bottlenecks that typically limit autoregressive generation on non-enterprise silicon, the MTP drafter employs a lightweight approach. The drafter, such as the 74 million parameter Gemma 4 E2B model, shares the key-value cache with the main model to avoid recalculating context. It also uses sparse decoding to narrow down likely token clusters, ensuring that compute cycles are utilised efficiently rather than being wasted moving parameters between VRAM and compute units.

Specific performance gains observed in Google's testing highlight the practical impact of this technology. The smaller E2B and E4B models achieved a 2.8x and 3.1x speed boost respectively when running on Pixel phones. Meanwhile, the larger 31B Dense model demonstrated a 2.5x performance improvement when executed on Apple M4 silicon, proving the efficacy of the approach across different consumer-grade architectures.

Google states that because the core Gemma model verifies all draft tokens, the process results in zero quality degradation. While the speculative predictions are not guaranteed to be perfect, the verification step ensures that errors common in generative AI systems do not increase. This means users can expect faster response times without a corresponding drop in the reliability of the generated text.

The optimised models are built on the same underlying technology as Google's frontier Gemini AI but are specifically tuned for local execution. They are compatible with popular edge AI frameworks including MLX, VLLM, SGLang, and Ollama, making the technology accessible to a wide range of developers and institutions looking to deploy AI locally.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026