Running local models on an M4 with 24GB memory

· ai · Source ↗

TLDR

  • Qwen 3.5-9B at Q4_K_S quant runs at ~40 tokens/sec with thinking, tool use, and 128K context on a 24GB M4 MacBook Pro via LM Studio.

Key Takeaways

  • Larger fits like Qwen 3.6 Q3, GPT-OSS 20B, and Devstral Small 24B technically fit in 24GB but were unusable in practice; Qwen 3.5-9B Q4 is the sweet spot.
  • LM Studio requires manually injecting {%- set enable_thinking = true %} into the prompt template to enable Qwen thinking mode.
  • Recommended inference params for coding: temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, repetition_penalty=1.0.
  • Works with both pi and OpenCode via LM Studio’s OpenAI-compatible endpoint at localhost:1234; OpenCode config sets context_length=131072, max_tokens=32768.
  • Best used interactively with step-by-step guidance; one-shot complex app generation fails and pegs the CPU without useful output.

Hacker News Comment Review

  • Confusion arose over whether a 24GB M4 exists; confirmed it does via Apple’s spec page, though the base M4 MacBook Pro ships with less.
  • The article actually states the ~40 tokens/sec figure, so the request to add token speed data was already answered in the post.

Original | Discuss on HN