Learning LLMs Through MicroGPT

I am slowly nerd sniping myself into learning more about how Large Language Models work. Improving instructions, skills, and agents is quite interesting on its own, but I am also attracted towards understanding the internals, at least to a basic level.

I had been meaning to look at Andrej Karpathy's MicroGPT, a 200-lines python script that builds a small model, including both training and inference. Over the last few weeks, I've finally had time to dive into it.

While reading the sources, I decided I wanted to make it a better learning tool by annotating the code. I decided to use AI to integrate Andrej's MicroGPT guide post directly into the code. It grew to more than 600 lines, but now I don't have to jump between the IDE and the browser to understand some details (I have zero ML knowledge). [1] [2]

The next thing that I noticed is that, while it is cool to have such a compact code for sharing, I had to scroll up and down, and except for the Value class, everything else are just methods inside a single file/module. I don't have an issue with multiple files, so I refactored the code into a few classes and modules. One thing I found interesting is that once the code was split into classes and modules, the model felt less "magical". The implementation remained almost identical, but the structure made it easier to reason about each piece independently.

The original script does training + inference, with fixed parameters and a fixed input filename. One of the first questions I wanted to answer was: what is actually required for inference once training is complete? So, after extracting most of the constants and hyperparameters to a configuration file, the first new feature I added was inference-only mode (--load). It saves and loads model weights and some metadata from JSON files, but hashes the parameters it trained with into the filename, so you can freely tweak and experiment training with multiple layers, more attention heads, or simply feeding more dataset documents.

I've also added support for different input datasets. I trained the model with 5 characters long words from the first Dune book, and it was not too bad coming with new words (4 out of 20 were valid new words that did not exist in the dataset.). It's really cool to see how it works in such a narrow scenario that you understand most of it. [3]

Other features that I added are:

  • A few command-line args to toggle the number of training steps, the temperature, or the number of results/samples to show in inference
  • Inference uses colors to tell, for each sample, if it exists or not in the input dataset, and if the model saw it during training or not
  • Inference is still generating N random results by default. They are only partially random, as the seed is fixed, but you can unfix it. And most importantly, you can also use "inference input" mode, to ask for a starting sequence and see N random completions of that "word"
  • Any config constant can be overridden easily, to try generating smarter models

And probably more tiny tweaks. I don't know if I'll add anything else to the code, as right now is already quite "tinkerable", but if you want to check it, it's up here: https://github.com/Kartones/microgpt.

This is an example of the default training output (my names.txt is just the original input.txt):

--- preparing model data ---
  input file          : datasets/names.txt
  training steps      : 1000
  temperature         : 0.5
  transformer layers  : 1
  embedding dimensions: 16
  context length      : 16
  attention heads     : 4
--- training ---
num docs: 32033
vocab size: 27
num params: 4192
step 1000 / 1000 | loss 2.6497

As you can see, the loss function ends at ~2.65. It hallucinates a significant number of names, but still, it has learned how to come up with others that are valid and has never seen during training.

I tested training a better model:

# at config_override.py
NUM_TRANSFORMER_LAYERS = 2
NUM_EMBEDDING_DIMENSIONS = 32
NUM_ATTENTION_HEADS = 8
DEFAULT_NUM_TRAINING_STEPS = 3000
--- preparing model data ---
  input file          : datasets/names.txt
  training steps      : 3000
  temperature         : 0.5
  transformer layers  : 2
  embedding dimensions: 32
  context length      : 16
  attention heads     : 8
--- training ---
num docs: 32033
vocab size: 27
num params: 26816
step 3000 / 3000 | loss 2.1404

This "medium" model seems to have fewer pure hallucinations based on a few experiments. As a side note, this was the only model configuration in my experiments that ever generated my name, diego, even though the name was not present in the training data. While I ran several experiments across the different model configurations, the sample size is far too small to draw any conclusions. Still, it was an interesting result.

Sample run with this model:

Enter starting sequence (max 15 characters): dieg
sample  1: diega
sample  2: diegr
sample  3: diegari
sample  4: diegria
sample  5: diegalia
sample  6: diege
sample  7: diegelan
sample  8: diegani
sample  9: diega
sample 10: diegon
sample 11: diegki
sample 12: diegari
sample 13: diegele
sample 14: diegmila
sample 15: diegalen
sample 16: dieghan
sample 17: diegris
sample 18: diego
sample 19: diegavi
sample 20: diegari

I also tried a big configuration:

# at config_override.py
NUM_TRANSFORMER_LAYERS = 4
NUM_EMBEDDING_DIMENSIONS = 32
NUM_ATTENTION_HEADS = 8
DEFAULT_NUM_TRAINING_STEPS = 10000

This larger model did not noticeably improve the generated output in my experiments, despite requiring significantly more training time and displaying also slower inference times. The loss improved, so maybe the inference quality reached a limit? Maybe it needed even more training steps?

Summary

My main takeaway was not learning how transformers work internally, but discovering how approachable they become once you can run, modify, and retrain one yourself. MicroGPT is small enough that every change feels understandable, yet complex enough to expose many of the concepts behind larger models. Refactoring and annotating the code also turned out to be one of the most effective ways to understand it. Having deterministic training and inference makes experimentation more rigorous than simply tweaking parameters and relying on intuition.

Notes

[1] : This MicroGPT visual explanation is also great, and saves you from print()-dumping values as I did when exploring the code 😄

[2] : I've done a full pass checking them, but I'm pretty sure they can be improved, specially now that I have refactored pars of the code

[3] : I confess that I haven't gone through all the math operations yet

Tags: AI & ML Development Patterns & Practices Python Resources

Learning LLMs Through MicroGPT article, written by Kartones. Publication date: