Running LLMs on your iPhone: 40 tok/s Gemma 4 with MLX — Adrien Grondin, Locally AI

· video · Source ↗

Summary based on the YouTube transcript and episode description.

Adrien Grondin demos Gemma 4 running at 40 tok/s on iPhone via MLX, and announces Locally AI was acquired by LM Studio.

  • Locally AI was acquired by LM Studio, which runs local models via Llama.cpp or MLX and exposes OpenAI/Anthropic-compatible local servers.\n- Gemma 4 (8-bit, 4-bit quantized) runs at 40 tokens/second on the latest iPhones using Apple’s MLX framework.\n- MLX Swift LM (GitHub) is the single package needed to add on-device LLM inference to any iOS/macOS/iPadOS app in under 10 minutes.\n- MLX Community on Hugging Face hosts ~4,000–5,000 quantized models; new releases appear quantized in 4-bit/6-bit/8-bit within ~30 minutes of launch.\n- Recommended quantization range is 4-bit to 8-bit; below 4-bit degrades output quality noticeably.\n- A 300–350M parameter liquid model in Locally AI runs fast enough to power Shortcuts automations for text processing tasks.\n- MLX Swift LM supports tool calling; structured generation support is still in progress via third-party packages.

2026-04-20 · Watch on YouTube