Paper introduces Orthrus, a dual-view diffusion framework that injects parallel decoding into frozen Qwen3 AR models with provably lossless output distribution.
Key Takeaways
Speedups of 4.25x (1.7B), 5.20x (4B), and 5.36x (8B) on Qwen3 backbones; peak claim is 7.8x tokens per forward pass.
Output is strictly lossless: an AR head verifies tokens generated by the diffusion head, accepting the longest matching prefix, preserving exact predictive distribution.
Only 16% of parameters are fine-tuned; the base LLM stays frozen, and both views share a single KV cache with O(1) memory overhead.
Outperforms speculative decoding methods EAGLE-3 and DFlash on token acceptance rate, especially at longer context lengths.
Unlike diffusion LLMs (e.g., Fast-dLLM-v2), Orthrus shows no accuracy drop on MATH-500 at ~6x speedup over Qwen3-8B baseline.
Hacker News Comment Review
Discussion is minimal; a co-author confirmed the core mechanism: trainable diffusion attention injected per layer, K=32 tokens projected in parallel then AR-verified.
Community interest focused on broader model support, specifically Qwen3 32B and quantized variants, which are not yet addressed in the repo.
Notable Comments
@ilaksh: Asked about Qwen3 32B and quant compatibility, pointing to real deployment gaps not covered by current model zoo.