What's in a GGUF, besides the weights – and what's still missing?

· ai · Source ↗

TLDR

  • GGUF bundles chat templates, special tokens, and sampler config in one file, but lacks tool call grammars, think tokens, projection models, and feature flags.

Key Takeaways

  • Chat templates are Jinja2 programs (~250 lines for Gemma4); inference engines must ship a full Jinja interpreter (minijinja, minja, or llama.cpp’s own impl).
  • Special tokens (EOS, BOS, tool call markers, turn delimiters) are stored in GGUF metadata and are essential for correct generation stopping and formatting.
  • GGUF now supports a general.sampling.sequence field for ordered sampler chains, fixing a gap where ollama and HF generation_config.json ignored step order.
  • Tool call output formats differ per model family (Qwen3, Qwen3.5, Gemma4 all differ); no GGUF-standard grammar exists yet, so every engine hardcodes parsers per release.
  • Three practical gaps: think_token missing from GGUF conversions despite HF adding it upstream; projection models require a second GGUF file breaking single-file ergonomics; no feature flags to signal image/tool/thinking support without substring hacks.

Hacker News Comment Review

  • No substantive HN discussion yet; comments are limited to a future publish date observation and brief GPU compatibility anecdotes unrelated to GGUF internals.

Original | Discuss on HN