Tech

Huawei releases KVarN, a native vLLM backend for high-throughput AI inference

Huawei’s latest contribution to the vLLM ecosystem addresses the traditional trade-off between model precision and inference efficiency, targeting agentic and long-context workloads.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
New open-source tool claims to triple context capacity without sacrificing speed or accuracy

Huawei has released KVarN, a native backend for the vLLM inference library designed to optimise key-value cache quantisation for large language models. The software is engineered specifically for agentic and long-context workloads, aiming to resolve the persistent industry challenge of balancing context capacity with computational throughput. By implementing variance-normalised quantisation, the tool claims to deliver three to five times greater context capacity while maintaining accuracy levels comparable to standard FP16 precision.

The release represents an official vLLM implementation of the research paper "KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks," identified under arXiv:2606.03458. Built on vLLM version 0.22.0 and distributed as a fork under the Apache 2.0 license, KVarN operates in float16 compute. This architecture allows it to bypass the calibration steps typically required by other quantisation methods, requiring only a single flag to activate within existing workflows.

Performance benchmarks conducted on the Qwen3-32B model demonstrate significant gains over existing standards. Tests involving 16K-context bursts with tensor parallelism of two showed that KVarN achieved approximately four times the KV-cache capacity of FP16 while delivering higher throughput. The implementation utilises a specific preset of asymmetric round-to-nearest quantisation, allocating 4 bits for keys and 2 bits for values, which Huawei states meets strict accuracy requirements for demanding production deployments.

This approach contrasts sharply with previous quantisation techniques, such as TurboQuant, which often resulted in substantial throughput reductions of 40 to 52 per cent despite offering increased capacity. KVarN’s methodology involves rotating raw float16 key-value tiles through Hadamard rotation to distribute channel outliers, followed by iterative variance normalisation to equalise standard deviations. This process shrinks quantisation error before rounding, allowing the system to retain FP16-level accuracy while exceeding FP16 throughput speeds.

For users deploying the software, Huawei notes that the tile and page size is currently fixed at 128, with additional sizes planned for future updates. On single-GPU setups with tight memory budgets, the vLLM CUDA-graph memory profiler may over-reserve resources, potentially reducing the effective KV-cache pool. In such cases, users are advised to set the environment variable VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS to zero or increase GPU memory utilisation to recover the full capacity benefits.

Continue reading

More from Tech

Read next: Meta launches AI Creator Assistant for Facebook amid security concerns
Read next: TikTok Never Dies: Documentary Examines US-China Trade War Through Lens of App Ban
Read next: Meta Embeds Unreleased Face-Recognition Code in Smart Glasses App