Tech

Engineer bypasses memory wall to run Gemma 4 on 2016 Xeon server

A software engineer has successfully deployed the Gemma 4 large language model on a recycled Intel Xeon E5-2620 v4 server, achieving reading-speed generation without a GPU by utilising the ik_llama.cpp engine and extensive memory tuning.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
Christina Sørensen demonstrates that deep inference engine optimisation can unlock state-of-the-art AI on legacy hardware

Christina Sørensen, a software engineer and member of the NixOS Steering Committee, has published technical documentation detailing the deployment of the Gemma 4 large language model on a recycled 2016 Intel Xeon E5-2620 v4 server. The hardware configuration consists of 128 GB of DDR3 RAM and lacks a dedicated graphics processing unit. Sørensen utilised the ik_llama.cpp inference engine with approximately 25 custom flags to bypass typical memory bandwidth limitations and cache thrashing, achieving text generation speeds comparable to reading pace.

The demonstration highlights the efficacy of speculative decoding and deep memory optimisation in extending the utility of aging silicon for state-of-the-art AI tasks. Sørensen noted that the system achieved an 82 GB footprint in DDR3 RAM, comprising approximately 25 GB of model weights and 56 GB of key-value cache at a full 262K context. The deployment required pairing a 26B verifier with a small drafter, alongside custom CPU kernels to handle Flash Attention, thereby bypassing the need for a GPU during heavy context processing.

Key optimisations included the use of the --cpu-moe flag to tune routing for CPU cache hierarchies and the --run-time-repack flag to align weight matrices with CPU cache layouts. Sørensen argued that the bottleneck to running state-of-the-art AI locally is not just silicon, but the need to deeply understand inference engine mechanics. The project underscores that for open-weight models, the usability moat is often defined by missing documentation and black-box wrappers rather than hardware constraints.

The engine detected Multi-Token Prediction layers and safely downgraded from Graph Split to Layer Split due to a lack of support for vertically slicing MTP architectures. Additionally, the --mlock flag was used to pin the 27 GB model in physical RAM, requiring adjustment of the kernel-side ulimit to prevent swapping to disk. The --no-kv-offload flag was employed to short-circuit GPU checks for the key-value cache, keeping it in system RAM.

Sørensen’s work suggests that the bleeding edge of open-weight AI is accessible through command-line tools and a thorough understanding of memory architecture. The demonstration serves as a practical case study for researchers and engineers looking to maximise performance on legacy hardware without relying on expensive data-center graphics cards or corporate API tokens.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026