Tech

Handwritten LLM Training Code Achieves 382-Fold Speedup on Apple Silicon

By leveraging Swift 6.2 features, relaxed floating-point math, and reverse-engineered AMX instructions, the author bridges the performance gap between Swift and C.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
A developer details a series of optimisations transforming a naive Swift matrix multiplication implementation from 2.8 Gflop/s to 1.1 Tflop/s.

A developer has published a detailed technical deep-dive demonstrating how to optimise handwritten Large Language Model training code in Swift for Apple Silicon. Starting with a naive implementation running at 2.8 Gflop/s, the author achieved a 382-fold performance increase, reaching 1.1 Tflop/s. This breakthrough relied on a combination of Swift 6.2 features, relaxed floating-point arithmetic, CPU multi-threading, and the reverse-engineering of undocumented AMX instructions alongside custom Metal GPU shaders.

The work references Andrej Karpathy's llm.c project as a baseline for validation, aiming to replicate a plain C implementation of a GPT2-compatible model within Swift without using external libraries. The initial naive Swift implementation was roughly 15–20 times slower than the reference C code, producing one training iteration every seven seconds. Through a rigorous optimisation process, the author improved this to approximately 12 tokens per second for training.

Significant performance gains were attributed to specific Swift 6.2 features, particularly MutableSpan and InlineArray. MutableSpan resolved overhead associated with array buffer mutation checks, while InlineArray allowed for stack-allocated arrays that matched C compiler unrolling strategies. These changes brought the Swift implementation to a point where it was roughly equivalent to the C code for inference and slightly faster for training.

To bridge the remaining gap, the author adopted Relaxed floating-point arithmetic from Swift-Numerics. This enabled fused-multiply-add operations, effectively bridging the performance gap between Swift and C's fast-math flags. Following this, CPU multi-threading was implemented via DispatchQueue.concurrentPerform, which provided a 5.4x improvement over the single-threaded version, although the author noted constraints related to memory traversal.

The implementation further incorporated reverse-engineered AMX instructions for a tiled matrix multiplication kernel, achieving a further 1.67x speedup over the SIMD-optimised CPU code. Finally, custom Metal GPU shaders were written to offload computation to the GPU. While this provided a slight training speedup over the AMX CPU implementation, it resulted in slower inference speeds compared to the CPU-based approaches.

The author released a test harness app, CwlLlmSwift, to validate the improvements. The final performance of 1.1 Tflop/s represents a significant milestone for raw Swift code, though the author cautions that established frameworks like Accelerate and CoreML remain superior choices for production machine learning workloads.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026