DeepSeek 4 Flash local inference engine for Metal

· ai · Source ↗

TLDR

  • antirez’s ds4.c is a Metal-only, single-model inference engine for DeepSeek V4 Flash, targeting 128GB+ MacBooks with 2-bit quants and disk KV cache.

Key Takeaways

  • Runs DeepSeek V4 Flash (284B MoE) in q2 on 128GB MacBook Pro M3 Max at ~27 t/s generation, ~59 t/s prefill via Metal.
  • Asymmetric 2-bit quant: only routed MoE experts are quantized (up/gate at IQ2_XXS, down at Q2_K); shared experts and projections stay at full precision.
  • KV cache is treated as a disk-first citizen using fast NVMe on MacBooks, enabling long-context inference without fitting KV in RAM.
  • ds4-server exposes OpenAI and Anthropic-compatible HTTP endpoints with SSE streaming, tool calls, and prefix reuse across stateless clients.
  • Deliberately not a generic GGUF runner: only works with custom GGUFs from antirez/deepseek-v4-gguf; validated against official logits at multiple context sizes.

Hacker News Comment Review

  • Commenters are excited about the focused single-model approach but skeptical that open-source inference economics can close the gap with frontier models long-term.
  • One commenter flagged that running competing models like Kimi 2.6 at useful throughput costs ~$20k/month in hardware, framing local inference as dependent on frontier-lab generosity for weight releases.

Notable Comments

  • @dakolli: argues unit economics make local inference competitive only if hardware costs stay under ~$1k/month, otherwise frontier API wins.

Original | Discuss on HN