Tech

Hands-on Workshop Released for Building Language Models from Scratch

Developers can now construct a functional GPT model in under an hour using PyTorch, Apple Silicon, or Google Colab to understand the fundamental mechanics of transformer training pipelines.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: Hacker News · original
Tech
No image available
New GitHub project simplifies the process of training large language models by stripping away high-level libraries and scaling architecture to fit standard hardware.

A new educational initiative hosted on GitHub under the repository angelos-p/llm-from-scratch has been released to guide users in building a language model from the ground up. The project aims to replicate the educational experience of Andrej Karpathy's nanoGPT by stripping away black-box libraries and requiring users to write every component of the training pipeline themselves. The target output is a functional GPT model capable of generating text, trained entirely on local hardware such as Apple Silicon or NVIDIA GPUs, or via Google Colab.

The workshop scales down the architecture of GPT-2, which typically contains 124 million parameters, to a model with approximately 10 million parameters. This reduction ensures the project can be trained on a standard laptop in under an hour, making it accessible for a single-session educational experience. By avoiding high-level abstractions, the project explicitly teaches the underlying mechanics of transformer training pipelines that are often obscured by commercial frameworks.

The implementation supports automatic configuration for Apple Silicon using MPS, NVIDIA GPUs via CUDA, and standard CPUs, alongside execution on Google Colab. Users can upload the repository to Colab, install dependencies, and run the training script directly in a notebook environment. This flexibility allows developers without a local high-performance setup to participate in the workshop and gain practical experience with the training process.

Technical specifications for the tutorial include character-level tokenisation with a vocabulary size of 65 and a block size of 256, specifically tailored for the Shakespeare dataset used in the guide. The author notes that Byte-Pair Encoding tokenisation, standard in GPT-2, is omitted for this specific small dataset because most token bigrams are too rare to learn effectively. BPE is reserved for larger datasets in Part 5 of the workshop to ensure the model can learn meaningful patterns from the available data.

Upon completion, users will have produced three core files: model.py, train.py, and generate.py. These files are written without using any pre-trained model loading functions, ensuring a complete understanding of how each component functions within the system. The workshop is designed to provide the same foundational insight that Andrej Karpathy's original nanoGPT project offered, fundamentally changing how many developers approach artificial intelligence by revealing the code behind the black box.

Continue reading

More from Tech

Read next: Apple to roll out manual EQ controls for AirPods in iOS 27 update
Read next: Apple rolls out visionOS 27, integrating AI-driven Siri into Vision Pro headset
Read next: Apple Overhauls Siri with Google Gemini Partnership and Standalone App at WWDC 2026