antirez’s ds4.c is a Metal-only, single-model inference engine for DeepSeek V4 Flash, targeting 128GB+ MacBooks with 2-bit quants and disk KV cache.
Key Takeaways
Runs DeepSeek V4 Flash (284B MoE) in q2 on 128GB MacBook Pro M3 Max at ~27 t/s generation, ~59 t/s prefill via Metal.
Asymmetric 2-bit quant: only routed MoE experts are quantized (up/gate at IQ2_XXS, down at Q2_K); shared experts and projections stay at full precision.
KV cache is treated as a disk-first citizen using fast NVMe on MacBooks, enabling long-context inference without fitting KV in RAM.
ds4-server exposes OpenAI and Anthropic-compatible HTTP endpoints with SSE streaming, tool calls, and prefix reuse across stateless clients.
Deliberately not a generic GGUF runner: only works with custom GGUFs from antirez/deepseek-v4-gguf; validated against official logits at multiple context sizes.
Hacker News Comment Review
Commenters are excited about the focused single-model approach but skeptical that open-source inference economics can close the gap with frontier models long-term.
One commenter flagged that running competing models like Kimi 2.6 at useful throughput costs ~$20k/month in hardware, framing local inference as dependent on frontier-lab generosity for weight releases.
Notable Comments
@dakolli: argues unit economics make local inference competitive only if hardware costs stay under ~$1k/month, otherwise frontier API wins.