Advanced Quantization Algorithm for LLMs

May 1, 2026 · ai · Source ↗

TLDR

Intel’s AutoRound uses sign-gradient descent to quantize LLMs/VLMs to 2-4 bits with minimal accuracy loss, targeting CPU/XPU/CUDA and major inference stacks.

Quantizes 7B models in ~10 minutes on a single GPU; INT2-mixed DeepSeek-R1 (~200GB) retains 97.9% accuracy.
Exports to AutoRound, AutoGPTQ, AutoAWQ, and GGUF formats; natively integrated into vLLM, SGLang, Transformers, and LLM-Compressor.
AutoScheme API auto-generates mixed-precision recipes (e.g., avg 3-bit across layers) in minutes with ~1.1-1.5x BF16 RAM overhead.
Supports MXFP4, NVFP4, FP8_BLOCK, W8A8, and GGUF variants alongside standard INT2-INT8 weight-only schemes.
Three recipes: auto-round (default), auto-round-best (3x slower, highest accuracy), auto-round-light (2-3x faster, slight accuracy drop).