We got 207 tok/s with Qwen3.5-27B on an RTX 3090

https://github.com/Luce-Org/lucebox-hub

Article

  • Claims 207 tok/s on RTX 3090 via C++/ggml speculative decoder with block-diffusion draft.
  • Uses DFlash draft model; 5.46x speedup over autoregressive baseline at peak.
  • Average 129.5 tok/s on 10-prompt benchmark at budget=22.
  • Pitches local AI as default: private data, no per-token cost, no vendor lock-in.

Discussion

  • Skepticism dominant: speculative decoding ≠ same quality as standard sampling.
  • Top comment (Aurornis): vibecoded Claude-generated repo, one of hundreds spawned by paper releases.
  • Critics note greedy-only decoding used; suggested sampling params exist for good reason.
  • Vulkan support requested — current implementation requires CUDA, limiting reach.

Discuss on HN


Type Link
Added Apr 20, 2026
Modified Apr 20, 2026