What's in a GGUF, besides the weights – and what's still missing?

May 14, 2026 · ai · Source ↗

TLDR

GGUF bundles chat templates, special tokens, and sampler config in one file, but lacks tool call grammars, think tokens, projection models, and feature flags.

Chat templates are Jinja2 programs (~250 lines for Gemma4); inference engines must ship a full Jinja interpreter (minijinja, minja, or llama.cpp’s own impl).
Special tokens (EOS, BOS, tool call markers, turn delimiters) are stored in GGUF metadata and are essential for correct generation stopping and formatting.
GGUF now supports a general.sampling.sequence field for ordered sampler chains, fixing a gap where ollama and HF generation_config.json ignored step order.
Tool call output formats differ per model family (Qwen3, Qwen3.5, Gemma4 all differ); no GGUF-standard grammar exists yet, so every engine hardcodes parsers per release.
Three practical gaps: think_token missing from GGUF conversions despite HF adding it upstream; projection models require a second GGUF file breaking single-file ergonomics; no feature flags to signal image/tool/thinking support without substring hacks.

No substantive HN discussion yet; comments are limited to a future publish date observation and brief GPU compatibility anecdotes unrelated to GGUF internals.