Training a GPT-2 From Scratch in Rust: What We Learned

The Case for Training From Scratch

Most engineers who work with LLMs never train one from scratch. Why would you? Pretraining a useful model costs millions in compute. But there are important reasons to understand what pretraining actually involves — and the best way to understand it is to do it on a small model where the cost is tractable.

This article summarizes the Feste blog series from Tag1 Consulting (tag1.com/how-to/), which documents training a GPT-2–scale model entirely in Rust. The series covers tokenizer implementation, architecture design, training loop, and optimization — the full stack, in a language without the neural network sugar of PyTorch. Working through it teaches you things you won't get from fine-tuning tutorials.

Why Rust?

Rust seems like an unlikely choice for ML. The ecosystem — at the time the Feste series was written — lacked the mature autograd and GPU abstractions of Python. So why do it?

The answer is that the constraints are pedagogically valuable. Without PyTorch's automatic differentiation, you implement backpropagation yourself, which forces a depth of understanding that torch.autograd() hides. Without a high-level training loop, you implement the optimizer, gradient accumulation, learning rate scheduling, and checkpointing explicitly. The result is a full understanding of what actually happens during training.

Practically, Rust's performance characteristics also make it competitive with Python for CPU-bound inference. Projects like candle (HuggingFace's Rust ML framework) and burn demonstrate that Rust-based ML is increasingly viable, and Feste is in the tradition of those explorations.

The Tokenizer (Feste Part 1)

Part 1 of the Feste series implements BPE from scratch. The implementation covers: reading a text corpus, computing initial character-level statistics, implementing the merge algorithm, writing the tokenizer's vocabulary to disk, and implementing encode/decode. The Rust implementation is notable for its clarity: the code makes the algorithm's logic visible in a way that NumPy-optimized Python implementations sometimes obscure.

Key insight from the implementation: the merge rules file is the tokenizer. There is no magic — the tokenizer is a sequence of find-and-replace rules applied greedily, plus a special-token table. Understanding this demystifies tokenization completely.

The Transformer Architecture (Feste Part 2)

Part 2 builds the transformer architecture in Rust without autograd. This means implementing forward passes for embeddings, layer normalization, multi-head self-attention, feed-forward networks, and the final language model head — and then implementing the corresponding backward passes by hand.

Implementing backpropagation for the attention mechanism by hand is illuminating. The gradients flow through the softmax, the dot products, and the value weighting in a specific way. Understanding this flow helps explain why gradient clipping is necessary (the softmax saturation region produces large gradients) and why residual connections are so important for stable training.

The Training Loop (Feste Part 3)

Part 3 covers the training infrastructure: data loading, batching, the forward-backward-update cycle, AdamW optimizer implementation, learning rate scheduling (cosine with warmup), gradient clipping, and checkpointing. This is where the engineering complexity lives — and where most introductory ML courses skip the details.

The implementation reveals something important: training stability is not automatic. It requires careful initialization, appropriate learning rates, gradient clipping thresholds, and warm-up periods. Running without these safeguards produces diverging loss curves. The Feste implementation includes detailed commentary on why each parameter was chosen.

What We Learned

Several lessons stand out from the Feste series:

Initialization matters enormously. Xavier and Kaiming initialization, applied correctly to different layer types, is the difference between training that converges and training that diverges from the first step.
The optimizer is load-bearing. AdamW — not Adam — is critical for large models. The weight decay in AdamW (applied to parameters, not accumulated gradients) prevents the gradient accumulation pathology that Adam develops on long training runs.
Tokenization quality shows up in training loss. A poor tokenizer produces a floor on cross-entropy that a good tokenizer doesn't hit. The gap between GPT-2's tokenizer and a poorly designed BPE tokenizer is visible in the training curves within 1,000 steps.
Data quality trumps data quantity at GPT-2 scale. A carefully filtered 10GB dataset outperforms a noisy 100GB dataset by a substantial margin on downstream evaluation.

The full Feste series is required reading in Meridian AI's MS in LLM Engineering program and is freely available at tag1.com/how-to/.