Intel’s AutoRound uses sign-gradient descent to quantize LLMs/VLMs to 2-4 bits with minimal accuracy loss, targeting CPU/XPU/CUDA and major inference stacks.
Key Takeaways
Quantizes 7B models in ~10 minutes on a single GPU; INT2-mixed DeepSeek-R1 (~200GB) retains 97.9% accuracy.
Exports to AutoRound, AutoGPTQ, AutoAWQ, and GGUF formats; natively integrated into vLLM, SGLang, Transformers, and LLM-Compressor.
AutoScheme API auto-generates mixed-precision recipes (e.g., avg 3-bit across layers) in minutes with ~1.1-1.5x BF16 RAM overhead.
Supports MXFP4, NVFP4, FP8_BLOCK, W8A8, and GGUF variants alongside standard INT2-INT8 weight-only schemes.