DeepSeek 4 Flash local inference engine for Metal

· ai · Source ↗

TLDR

  • antirez’s ds4.c is a single-model Metal inference engine for DeepSeek V4 Flash, targeting 128GB+ MacBooks and Mac Studios with 2-bit quantization and disk KV cache.

Key Takeaways

  • Deliberately narrow scope: one model (DeepSeek V4 Flash), Metal-only, not a generic GGUF runner; CUDA support is possible but unplanned.
  • 2-bit quants use asymmetric quantization – only routed MoE experts quantized (IQ2_XXS up/gate, Q2_K down) – fitting 81GB on 128GB machines.
  • KV cache is treated as a disk-first citizen; SSD persistence enables long-context inference without saturating RAM; 100k-token context uses ~26GB extra.
  • ds4-server exposes both OpenAI-compatible and Anthropic-compatible endpoints with SSE streaming, prefix reuse, and DSML tool-call translation.
  • Benchmarks on M3 Max 128GB: 58 t/s prefill / 26 t/s generation (short prompt); M3 Ultra 512GB reaches 468 t/s prefill.

Hacker News Comment Review

  • The narrow single-model focus resonated strongly; several builders noted it mirrors a broader trend of using AI-assisted kernel writing to create tight, hardware-specific inference stacks rather than extending large frameworks.
  • Prefill latency for large coding-agent prompts (25k+ tokens) was flagged as a practical pain point on M3 Max, though M3 Ultra’s ~468 t/s prefill and the KV prefix-reuse design in ds4-server reduce the real-world impact for repeat sessions.
  • Unit economics of self-hosting a 284B-parameter model drew skepticism; one commenter noted data-center inference is technically more energy-efficient per user than local hosting, citing antirez’s own 50W peak power figure during generation.

Notable Comments

  • @dejli: “clone and make it, and it just works, no python shenanigans” – highlights the C-native build as a rare friction-free setup in the LLM tooling ecosystem.
  • @lhl: argues SOTA AI-assisted kernel optimization makes single-hardware inference engines newly viable, citing his own RDNA3 W7900 project as evidence.

Original | Discuss on HN