Hands-on Workshop Released for Building Language Models from Scratch
Developers can now construct a functional GPT model in under an hour using PyTorch, Apple Silicon, or Google Colab to understand the fundamental mechanics of transformer training pipelines.
A new educational initiative hosted on GitHub under the repository angelos-p/llm-from-scratch has been released to guide users in building a language model from the ground up. The project aims to replicate the educational experience of Andrej Karpathy's nanoGPT by stripping away black-box libraries and requiring users to write every component of the training pipeline themselves. The target output is a functional GPT model capable of generating text, trained entirely on local hardware such as Apple Silicon or NVIDIA GPUs, or via Google Colab.
The workshop scales down the architecture of GPT-2, which typically contains 124 million parameters, to a model with approximately 10 million parameters. This reduction ensures the project can be trained on a standard laptop in under an hour, making it accessible for a single-session educational experience. By avoiding high-level abstractions, the project explicitly teaches the underlying mechanics of transformer training pipelines that are often obscured by commercial frameworks.
The implementation supports automatic configuration for Apple Silicon using MPS, NVIDIA GPUs via CUDA, and standard CPUs, alongside execution on Google Colab. Users can upload the repository to Colab, install dependencies, and run the training script directly in a notebook environment. This flexibility allows developers without a local high-performance setup to participate in the workshop and gain practical experience with the training process.
Technical specifications for the tutorial include character-level tokenisation with a vocabulary size of 65 and a block size of 256, specifically tailored for the Shakespeare dataset used in the guide. The author notes that Byte-Pair Encoding tokenisation, standard in GPT-2, is omitted for this specific small dataset because most token bigrams are too rare to learn effectively. BPE is reserved for larger datasets in Part 5 of the workshop to ensure the model can learn meaningful patterns from the available data.
Upon completion, users will have produced three core files: model.py, train.py, and generate.py. These files are written without using any pre-trained model loading functions, ensuring a complete understanding of how each component functions within the system. The workshop is designed to provide the same foundational insight that Andrej Karpathy's original nanoGPT project offered, fundamentally changing how many developers approach artificial intelligence by revealing the code behind the black box.


