Qwen3.6-27B runs locally on 32GB RAM at ~25 tok/s and handles 95% of real coding tasks for many developers.
Key Takeaways
Q4_K_M quantization fits in ~16.8GB; a 32GB machine is sufficient, 24GB gets ~91K context window.
Simon Willison benchmarked it on M5 Pro: 54 tok/s prefill, 25 tok/s generation – faster than expected for the size.
Apple Silicon users report qwen3.6:35b-a3b-nvfp4 via ollama runs well on 32GB M-series machines.
Closes meaningful ground on Claude Opus 4.7 for routine coding; Opus still leads on complex multi-step reasoning.
Open-source model pricing is a fraction of Anthropic/OpenAI API costs, intensifying the competitive squeeze on frontier labs.
Hacker News Comment Review
Consensus: Easter 2026 (Gemma 4 + Qwen3.6) marks a step-change in local model viability – the gap to hosted frontier models is narrowing fast but not closed.
Practical ceiling: commenters who run these daily still keep Opus as a fallback for tasks requiring sustained reliability; local models still “wander off.”
Hardware transparency is a recurring complaint – the community wants model announcements to lead with consumer GPU/RAM requirements and tok/s numbers, not just benchmarks.
KV-cache math is being done in the comments: 70MB per 1K context on 24GB at Q4_K_M; Q5 trades context headroom for quality.
Notable Comments
@simonw: Ran it on M5 Pro 128GB; benchmarked 25.57 tok/s generation on a 4,444-token output – “I like it better than the pelican I got from Opus 4.7.”
@syntaxing: Running Qwen3.6 35B and Gemma 4 26B fully local on M4 MBP – “it does 95% of what I need which is already crazy.”
@zkmon: Q4_K_M on llama.cpp yields ~91K context on 24GB VRAM; ~70MB per 1K context for KV-cache budgeting.
@jameson: Raises the moat question directly – open-source models at a fraction of Opus 4.6 pricing erode OpenAI/Anthropic’s margin buffer.