Xiaomi claims breakthrough in AI inference speed with MiMo-V2.5-Pro
The Chinese technology giant, working with TileRT, attributes the performance to extreme model-system codesign rather than specialised accelerators.
Xiaomi has announced the release of the UltraSpeed mode for its MiMo-V2.5-Pro large language model, a development achieved in collaboration with TileRT. The update allows the 1-trillion-parameter model to generate text at speeds exceeding 1,000 tokens per second, a milestone the company states has not been previously achieved on commodity graphics processing units.
The performance claim rests on what Xiaomi describes as extreme model-system codesign. This approach suggests that the speed is derived from deep integration between the model architecture and the underlying system software, rather than relying solely on hardware upgrades or specialised AI accelerators. By optimising the interaction between the model and standard commercial hardware, the firm aims to lower the barriers to deploying large-scale AI models.
Commodity GPUs refer to standard, commercially available graphics processing units, as opposed to custom-built or specialised AI accelerators often required for high-performance computing. Token generation speed remains a critical metric for evaluating the efficiency and real-time usability of large language models in production environments. Achieving high throughput on widely available hardware could have significant implications for cost structures and deployment scalability.
The announcement was published via the Xiaomi MiMo blog and reported on Hacker News, where the technical details of the UltraSpeed mode were highlighted. The source material does not specify the exact hardware specifications of the commodity GPUs used in the benchmark, nor does it elaborate on the technical methodology behind the extreme model-system codesign.
It remains unclear whether the reported speed is sustained under varying load conditions or if it is achieved only within controlled benchmark environments. As with any performance claim involving new software modes, independent verification against established benchmarks would be necessary to confirm the reproducibility of these results across different operational contexts.


