Training an LLM in Swift, Part 1: Taking matrix multiplication from Gflop/s to Tflop/s

· ai hardware · Source ↗

TLDR

  • Step-by-step optimization of handwritten matrix multiplication in Swift on Apple Silicon, from 2.8 Gflop/s to Tflop/s across 10 implementations including Metal.

Key Takeaways

  • Basic Swift matmul runs 15-20x slower than C at -O3 due to _ArrayBuffer.beginCOWMutation() uniqueness checks dominating profiling in Instruments.
  • Swift 6.2 MutableSpan eliminates COW overhead with near-zero cost; adding one mutableSpan call tripled training iteration throughput.
  • C gains an edge via -ffast-math enabling fused-multiply-add (FMA) instructions; Swift emits separate fmul/fadd ops, hurting inner-loop throughput.
  • Benchmark baseline is Karpathy’s llm.c GPT-2 model: ~0.19 trillion FLOPs per training iteration, targeting full forward+backward+weight-update cycles, not just inference.
  • Series covers CPU, SIMD, AMX, and Metal GPU paths on Apple Silicon; production use should still prefer Apple’s optimized ML frameworks.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN