Step-by-step optimization of handwritten matrix multiplication in Swift on Apple Silicon, from 2.8 Gflop/s to Tflop/s across 10 implementations including Metal.
Key Takeaways
Basic Swift matmul runs 15-20x slower than C at -O3 due to _ArrayBuffer.beginCOWMutation() uniqueness checks dominating profiling in Instruments.
Swift 6.2 MutableSpan eliminates COW overhead with near-zero cost; adding one mutableSpan call tripled training iteration throughput.
C gains an edge via -ffast-math enabling fused-multiply-add (FMA) instructions; Swift emits separate fmul/fadd ops, hurting inner-loop throughput.
Benchmark baseline is Karpathy’s llm.c GPT-2 model: ~0.19 trillion FLOPs per training iteration, targeting full forward+backward+weight-update cycles, not just inference.
Series covers CPU, SIMD, AMX, and Metal GPU paths on Apple Silicon; production use should still prefer Apple’s optimized ML frameworks.