Qwen 3.5-9B Q4_K_S via LM Studio is the practical sweet spot for 24GB M4 Macs: ~40 tok/s, thinking enabled, 128K context, functional tool use.
Key Takeaways
Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B technically fit in 24GB but were unusable in practice; Gemma 4B ran but failed at tool use.
Recommended inference settings for thinking/coding: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0; enable thinking via Prompt Template injection in LM Studio.
Both pi and OpenCode work as agent harnesses; pi needs manual config (models.json, hideThinkingBlock), OpenCode via opencode.json with @ai-sdk/openai-compatible.
Works best as an interactive, step-by-step assistant (lint fixes, git conflict resolution) rather than autonomous app-building; babysitting required.
Benefits: fully offline, no subscription, electricity-only cost assuming existing hardware.
Hacker News Comment Review
Commenters broadly agree 9B is too small for serious coding; Gemma 4 31B dense is cited as the new local baseline, requiring 36-70GB RAM on M5 Max for full 256K context.
Strong pushback on capability framing: local models at this size are not close to Opus 4.7 or frontier models, and a vocal HN minority is seen as overstating local LLM quality.
The 32-40GB RAM threshold comes up repeatedly as the practical minimum for genuinely useful local coding; 24GB is one tier short, and a 4090 with rotorquant/turboquant activation optimizations is a viable alternative path.
Notable Comments
@PAndreew: 30-35B+ models handle Excel, legal translation, email drafting, and PPT work well with the added benefit of keeping company data private.
@rapatel0: Qwen3.6 27B runs on a 4090 24GB with ~128K context using q4_xl+rotorquant; links reference implementation at github.com/rapatel0/rq-models.