Advanced Quantization Algorithm for LLMs

· ai · Source ↗

TLDR

  • Intel’s AutoRound uses sign-gradient descent to quantize LLMs/VLMs to 2-4 bits with minimal accuracy loss, targeting CPU/XPU/CUDA and major inference stacks.

Key Takeaways

  • Quantizes 7B models in ~10 minutes on a single GPU; INT2-mixed DeepSeek-R1 (~200GB) retains 97.9% accuracy.
  • Exports to AutoRound, AutoGPTQ, AutoAWQ, and GGUF formats; natively integrated into vLLM, SGLang, Transformers, and LLM-Compressor.
  • AutoScheme API auto-generates mixed-precision recipes (e.g., avg 3-bit across layers) in minutes with ~1.1-1.5x BF16 RAM overhead.
  • Supports MXFP4, NVFP4, FP8_BLOCK, W8A8, and GGUF variants alongside standard INT2-INT8 weight-only schemes.
  • Three recipes: auto-round (default), auto-round-best (3x slower, highest accuracy), auto-round-light (2-3x faster, slight accuracy drop).

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN