DeepSeek 4 Flash local inference engine for Metal

May 7, 2026 · ai · Source ↗

TLDR

antirez’s ds4.c is a single-model Metal inference engine for DeepSeek V4 Flash, targeting 128GB+ MacBooks and Mac Studios with 2-bit quantization and disk KV cache.

Deliberately narrow scope: one model (DeepSeek V4 Flash), Metal-only, not a generic GGUF runner; CUDA support is possible but unplanned.
2-bit quants use asymmetric quantization – only routed MoE experts quantized (IQ2_XXS up/gate, Q2_K down) – fitting 81GB on 128GB machines.
KV cache is treated as a disk-first citizen; SSD persistence enables long-context inference without saturating RAM; 100k-token context uses ~26GB extra.
ds4-server exposes both OpenAI-compatible and Anthropic-compatible endpoints with SSE streaming, prefix reuse, and DSML tool-call translation.
Benchmarks on M3 Max 128GB: 58 t/s prefill / 26 t/s generation (short prompt); M3 Ultra 512GB reaches 468 t/s prefill.

The narrow single-model focus resonated strongly; several builders noted it mirrors a broader trend of using AI-assisted kernel writing to create tight, hardware-specific inference stacks rather than extending large frameworks.
Prefill latency for large coding-agent prompts (25k+ tokens) was flagged as a practical pain point on M3 Max, though M3 Ultra’s ~468 t/s prefill and the KV prefix-reuse design in ds4-server reduce the real-world impact for repeat sessions.
Unit economics of self-hosting a 284B-parameter model drew skepticism; one commenter noted data-center inference is technically more energy-efficient per user than local hosting, citing antirez’s own 50W peak power figure during generation.

@dejli: “clone and make it, and it just works, no python shenanigans” – highlights the C-native build as a rare friction-free setup in the LLM tooling ecosystem.
@lhl: argues SOTA AI-assisted kernel optimization makes single-hardware inference engines newly viable, citing his own RDNA3 W7900 project as evidence.