Unsloth and NVIDIA combined three low-level GPU optimizations to cut LLM fine-tuning step time by ~25% across RTX laptops to DGX Spark hardware.
Key Takeaways
Packed-sequence metadata caching eliminates per-layer reconstruction of cu_seqlens and SDPA masks, yielding +43.3% forward pass speedup on Qwen3-14B QLoRA SFT.
Double-buffered gradient checkpoint reloads overlap CPU-to-GPU activation copies with backward compute, saving 207-279 ms per step on 8B-14B models at ~55.7 GB/s pinned bandwidth.
MoE routing fix replaces an O(num_experts) torch.where loop with a single argsort+bincount pass, reducing data-dependent dispatch overhead.
Memory cost for double buffering stays modest: +0.23 to +0.47 GB depending on model size, with a clean VRAM fallback.
All three optimizations share one pattern: eliminate repeated bookkeeping and overlap copy streams with compute rather than serializing them.