Google just casually disrupted the open-source AI narrative…
Fireship explains how Google’s Gemma 4 achieves frontier-class intelligence in a model small enough to run on a single RTX 4090.
- Gemma 4 is released under Apache 2.0 — genuinely free, not research-only or revenue-restricted like Meta’s Llama license.
- The 27B parameter Gemma 4 runs locally at ~10 tok/s on one RTX 4090 with a 20 GB download; comparable Kimi K2.5 needs 600+ GB and multiple H100s.
- Key architectural innovation: per-layer embeddings give each transformer layer its own token representation, reducing wasted information carry-through.
- Google simultaneously published TurboQuant, a quantization technique that converts weights to polar coordinates and applies Johnson-Lindenstrauss transforms to compress to single sign bits while preserving distances.
- The real AI bottleneck is memory bandwidth, not compute — Gemma 4 attacks VRAM read cost, not raw parameter count.
- Gemma 4 scores in the same benchmark range as Kimi K2.5 Thinking despite being roughly 20x smaller in download size.
- Google is the first FAANG company to release a competitive LLM under a truly open-source license with no commercial restrictions.
2026-04-08 · Watch on YouTube