Granite 4.1: IBM's 8B Model Matching 32B MoE

· ai coding · Source ↗

TLDR

  • IBM’s Granite 4.1 8B dense model matches or beats the previous 32B MoE Granite 4.0-H-Small across benchmarks, Apache 2.0 licensed.

Key Takeaways

  • Three sizes (3B, 8B, 30B), all dense decoder-only transformers, 15T training tokens across five phased data mixtures with shifting math/code ratios.
  • 8B scores 69.0 on ArenaHard, 68.3 on BFCL V3 tool calling, 92.5 on GSM8K, beating the older 32B MoE with 9B active params on each.
  • Four-stage RL pipeline caught a mid-training regression: RLHF improved chat by ~18.9 pts but dropped math; a dedicated math RL stage recovered GSM8K by 3.8 pts and DeepMind-Math by 23.5 pts.
  • 4.1M fine-tuning samples passed a six-dimension LLM-as-Judge filter with automatic rejection for hallucinations, false premises, and incorrect computations.
  • 512K context achieved via staged extension (32K to 128K to 512K) plus model merging at each stage to preserve short-context performance; RULER scores degrade gradually, not sharply.

Hacker News Comment Review

  • Practical testers find the 8B competitive on commodity hardware but rank Qwen3.6 35B MoE as still stronger locally; Granite’s more recent training data cutoff is cited as its clearest differentiator.
  • The pivot away from MoE at the 8B scale is noted alongside Mistral doing similar; commenters see a clinical, low-emoji tone as a feature for data processing workloads.
  • Skepticism exists that the source article is AI-written, reducing trust in its analysis, though IBM’s own research blog and Hugging Face weights are flagged as the authoritative references.

Notable Comments

  • @cbg0: Flags granite-vision-4.1-4b as a potential sleeper for table and semantic key-value extraction if benchmarks hold at that size.
  • @m3at: Points to IBM Research blog and ibm-granite Hugging Face collection as primary sources over the linked article.

Original | Discuss on HN