IBM’s Granite 4.1 8B dense model matches or beats the previous 32B MoE Granite 4.0-H-Small across benchmarks, Apache 2.0 licensed.
Key Takeaways
Three sizes (3B, 8B, 30B), all dense decoder-only transformers, 15T training tokens across five phased data mixtures with shifting math/code ratios.
8B scores 69.0 on ArenaHard, 68.3 on BFCL V3 tool calling, 92.5 on GSM8K, beating the older 32B MoE with 9B active params on each.
Four-stage RL pipeline caught a mid-training regression: RLHF improved chat by ~18.9 pts but dropped math; a dedicated math RL stage recovered GSM8K by 3.8 pts and DeepMind-Math by 23.5 pts.
4.1M fine-tuning samples passed a six-dimension LLM-as-Judge filter with automatic rejection for hallucinations, false premises, and incorrect computations.
512K context achieved via staged extension (32K to 128K to 512K) plus model merging at each stage to preserve short-context performance; RULER scores degrade gradually, not sharply.
Hacker News Comment Review
Practical testers find the 8B competitive on commodity hardware but rank Qwen3.6 35B MoE as still stronger locally; Granite’s more recent training data cutoff is cited as its clearest differentiator.
The pivot away from MoE at the 8B scale is noted alongside Mistral doing similar; commenters see a clinical, low-emoji tone as a feature for data processing workloads.
Skepticism exists that the source article is AI-written, reducing trust in its analysis, though IBM’s own research blog and Hugging Face weights are flagged as the authoritative references.
Notable Comments
@cbg0: Flags granite-vision-4.1-4b as a potential sleeper for table and semantic key-value extraction if benchmarks hold at that size.
@m3at: Points to IBM Research blog and ibm-granite Hugging Face collection as primary sources over the linked article.