Cutting inference cold starts by 40x with LP, FUSE, C/R, and CUDA-checkpoint

· ai systems cloud · Source ↗

TLDR

  • Modal reduced GPU inference replica spin-up from ~2000 seconds to ~50 seconds using four layered infrastructure techniques.

Key Takeaways

  • Four techniques stack: cloud LP-managed GPU buffers, lazy content-addressed FUSE filesystem, CPU-side checkpoint/restore, and CUDA context checkpoint/restore.
  • GPU buffer pool is managed by a linear program (Google GLOP) fed live cloud prices and observed supply, keeping idle machines off the hot path.
  • Lazy container image loading via a content-addressed multi-tier cache eliminates sequential layer pulls; files are served on-demand rather than pre-loaded.
  • CUDA checkpoint/restore skips full GPU-side initialization by snapshotting and restoring CUDA contexts directly into GPU memory on resume.
  • Real-world GPU hardware failure rates are high enough that active health checks on boot plus weekly deep diagnostics (dcgmi diag) are required to keep the buffer reliable.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN