Tech

Unsloth enables local deployment of Z.ai’s GLM-5.2 with aggressive quantisation

New documentation from Unsloth details how to run Z.ai’s 744-billion parameter GLM-5.2 locally using Dynamic GGUFs, offering significant hardware accessibility for developers and institutions.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Media Research

Related coverage

Explore Artificial Intelligence coverage Explore Media coverage Explore Research coverage More from the Tech desk

Tech

No image available

Open-source model rivals proprietary giants as storage requirements drop from 1.51TB to under 240GB

Unsloth has published comprehensive documentation enabling the local execution of Z.ai’s GLM-5.2, a new open-source large language model featuring 744 billion parameters. By utilising Dynamic GGUF quantisation, the platform significantly reduces the model’s storage footprint, allowing inference via Unsloth Studio and llama.cpp across MacOS, Windows, and Linux operating systems.

The 744-billion parameter model, which includes 40 billion active parameters and a 1 million token context window, is positioned by Unsloth as delivering state-of-the-art performance in long-horizon coding, reasoning, and agentic tasks. The company claims the model performs on par with major proprietary offerings, including Claude 4.8 Opus, GPT-5.5, and Gemini 3.1 Pro, according to benchmarks from Artificial Analysis.

Quantisation is central to the model’s accessibility. The full unquantised model requires 1.51TB of disk space. However, Unsloth’s Dynamic 2-bit GGUF (UD-IQ2_M) reduces this to 239GB, an 84 per cent reduction, while the 1-bit quantisation lowers storage requirements to 217GB. These reductions are achieved by upcasting important layers to 8 or 16-bit precision to maintain accuracy.

Hardware requirements vary by quantisation level. The 2-bit version requires approximately 245GB of total memory, fitting directly on a 256GB unified memory Mac or a setup with a 1x24GB GPU and 256GB of RAM using Mixture of Experts (MoE) offloading. The 1-bit version requires roughly 223GB of RAM, while an 8-bit version demands 810GB. Unsloth recommends the 2-bit quantisation for a balance of accessibility and accuracy.

Accuracy metrics from KLD (KL Divergence) testing indicate that dynamic 4-bit and 5-bit quantisations are generally lossless. The 2-bit quantisation achieves approximately 82 per cent top-1 per cent accuracy, while the 1-bit version reaches 76.2 per cent. The model supports three thinking modes—Non-thinking, High, and Max—with Unsloth Studio allowing users to toggle these settings via a web interface.

In addition to quantisation, Unsloth Studio has been updated to support secure HTTPS launches via free Cloudflare tunnels, enhancing security for local deployments. The platform automatically offloads to RAM, detects multi-GPU setups, and allows for automatic inference parameter tuning, simplifying the process for users on MacOS, Windows, and Linux.

The documentation highlights that for optimal performance with llama.cpp, users should ensure total available memory exceeds the quantised model file size. KV cache quantisation can further extend context lengths, with q4_0 quantisation potentially increasing context support by up to 3.5 times compared to default settings.

Unsloth enables local deployment of Z.ai’s GLM-5.2 with aggressive quantisation

More from Tech

Tech giants slash 21,000 jobs at Oracle as AI reshapes workforce strategy

OpenAI partners with Trail of Bits to bolster open-source security

Nvidia claims Rubin generation liquid cooling cuts AI data centre water use to near zero