Tech

Handwritten LLM Training Code Achieves 382-Fold Speedup on Apple Silicon

By leveraging Swift 6.2 features, relaxed floating-point math, and reverse-engineered AMX instructions, the author bridges the performance gap between Swift and C.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Media Research

Related coverage

Explore Artificial Intelligence coverage Explore Media coverage Explore Research coverage More from the Tech desk

Tech

No image available

A developer details a series of optimisations transforming a naive Swift matrix multiplication implementation from 2.8 Gflop/s to 1.1 Tflop/s.

A developer has published a detailed technical deep-dive demonstrating how to optimise handwritten Large Language Model training code in Swift for Apple Silicon. Starting with a naive implementation running at 2.8 Gflop/s, the author achieved a 382-fold performance increase, reaching 1.1 Tflop/s. This breakthrough relied on a combination of Swift 6.2 features, relaxed floating-point arithmetic, CPU multi-threading, and the reverse-engineering of undocumented AMX instructions alongside custom Metal GPU shaders.

The work references Andrej Karpathy's llm.c project as a baseline for validation, aiming to replicate a plain C implementation of a GPT2-compatible model within Swift without using external libraries. The initial naive Swift implementation was roughly 15–20 times slower than the reference C code, producing one training iteration every seven seconds. Through a rigorous optimisation process, the author improved this to approximately 12 tokens per second for training.

Significant performance gains were attributed to specific Swift 6.2 features, particularly MutableSpan and InlineArray. MutableSpan resolved overhead associated with array buffer mutation checks, while InlineArray allowed for stack-allocated arrays that matched C compiler unrolling strategies. These changes brought the Swift implementation to a point where it was roughly equivalent to the C code for inference and slightly faster for training.

To bridge the remaining gap, the author adopted Relaxed floating-point arithmetic from Swift-Numerics. This enabled fused-multiply-add operations, effectively bridging the performance gap between Swift and C's fast-math flags. Following this, CPU multi-threading was implemented via DispatchQueue.concurrentPerform, which provided a 5.4x improvement over the single-threaded version, although the author noted constraints related to memory traversal.

The implementation further incorporated reverse-engineered AMX instructions for a tiled matrix multiplication kernel, achieving a further 1.67x speedup over the SIMD-optimised CPU code. Finally, custom Metal GPU shaders were written to offload computation to the GPU. While this provided a slight training speedup over the AMX CPU implementation, it resulted in slower inference speeds compared to the CPU-based approaches.

The author released a test harness app, CwlLlmSwift, to validate the improvements. The final performance of 1.1 Tflop/s represents a significant milestone for raw Swift code, though the author cautions that established frameworks like Accelerate and CoreML remain superior choices for production machine learning workloads.

Handwritten LLM Training Code Achieves 382-Fold Speedup on Apple Silicon

More from Tech

Apple to roll out manual EQ controls for AirPods in iOS 27 update

Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset

Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026