Tech

Google DeepMind releases DiffusionGemma with fourfold speed increase for local inference

The 26-billion-parameter Mixture of Experts model is optimised for Nvidia hardware, offering significantly faster output for local AI workloads compared to traditional autoregressive approaches.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Ars Technica · original

Artificial Intelligence Policy Research

Related coverage

Explore Artificial Intelligence coverage Explore Policy coverage Explore Research coverage More from the Tech desk

Google's latest DiffusionGemma open AI model comes with a 4x speed boost

Open-source model shifts text generation from linear to parallel processing

Google DeepMind has introduced DiffusionGemma, a new addition to the Gemma 4 open model family that utilises diffusion technology for text generation. Unlike standard autoregressive models that generate text linearly from left to right, DiffusionGemma produces output in parallel blocks. This architectural shift allows the model to achieve approximately four times the output speed of similarly sized autoregressive models, marking a significant departure from the sequential token generation that currently dominates the industry.

The model is a Mixture of Experts (MoE) architecture with 26 billion total parameters, though only 3.8 billion are activated during inference. This design enables the model to fit within the 18GB memory allotment of high-end graphics cards, making it accessible for local deployment. Google collaborated with Nvidia to optimise DiffusionGemma for a range of hardware configurations, including quantised high-end RTX GPUs and enterprise systems such as the DGX Spark platform.

Performance benchmarks indicate substantial speed improvements on local hardware. In testing, the model achieved output speeds of around 700 tokens per second on an Nvidia RTX 5090 and exceeded 1,000 tokens per second on a single Nvidia H100 AI accelerator. By shifting the performance bottleneck from memory bandwidth to compute, the model can generate up to 256 tokens in parallel, mitigating the wasted compute cycles and idle time often associated with lower memory bandwidth in local setups.

While diffusion technology is most commonly associated with image generation, its application to text offers distinct advantages for specific non-linear tasks. Google notes improved performance in areas such as in-line editing, molecular sequencing, mathematical graphing, and solving Sudoku puzzles. The model’s ability to continuously self-correct large sets of tokens makes it particularly effective for tasks where each token depends on future context, a challenge for standard autoregressive models.

Despite these gains, diffusion remains less suitable for large-scale cloud-based services like Google’s Gemini models. Cloud environments utilise high-bandwidth memory and can batch large numbers of compute jobs efficiently, whereas diffusion models can waste resources when generating short outputs and carry a higher error rate for discrete text. Google also recently implemented Multi-Token Prediction drafters to utilise wasted compute cycles, but states that diffusion technology remains faster than these MTP versions.

DiffusionGemma is available under the Apache 2.0 license via Hugging Face, consistent with the rest of the Gemma 4 family. Google describes the release as experimental, but it represents a viable avenue for researchers and developers seeking to leverage local AI processing for complex, parallelisable tasks without the latency constraints of traditional sequential generation.

Google DeepMind releases DiffusionGemma with fourfold speed increase for local inference

More from Tech

Xbox confirms Fable reboot release date and details Living Population system

Valve to halt production of physical Steam gift cards amid persistent fraud

US Intelligence Bill Stalls as Lawmakers Clash Over Acting DNI Appointment