Solo 18-month project implementing a full transformer engine in C: inference, training, BPE tokenizer, and vision, supporting Gemma, Llama2, PaliGemma, GPT2.
Key Takeaways
~19,300 lines across 7 files; math.c pairs every forward op with its backward counterpart for readability as annotated learning material.
Supports SafeTensors and Karpathy formats, bf16/fp16/fp32 weights, AdamW with cosine LR, top-k/top-p sampling, and mmap RAM mode.
PaliGemma multimodal inference works via JPEG input with X11 display; X11 code and JSON parser are the only AI-generated sections.
No cmake, no Python, no external ML frameworks – make is the only build step; deps are gcc 13+, libjpeg, libx11.
float32 outperforms bf16/fp16 on CPU inference due to lack of CPU-native support for those formats – author notes this was surprising.