Moonshot AI’s open-weights Kimi K2.6 won a 10-model real-time programming contest (Word Gem Puzzle), outscoring Claude Opus 4.7, GPT-5.5, and Gemini Pro 3.1.
Key Takeaways
Kimi K2.6 finished 7-1-0 with 22 match points; Xiaomi’s MiMo V2-Pro was second; all Western frontier labs placed third or lower.
Kimi won via greedy tile-sliding: score each move by new positive-value words unlocked, execute best, repeat – highest cumulative score (77) in the tournament.
MiMo never slid once; it blasted claims from the initial grid in one TCP packet, scoring only on boards where seed words survived the scramble.
Claude and Grok also never slid, which collapsed their scores on 30x30 grids where reconstruction was the only path to points.
Kimi K2.6 scores 54 on the Artificial Analysis Intelligence Index vs. GPT-5.5 at 60 and Claude at 57 – open-weights within a few index points of closed frontier models.
Hacker News Comment Review
Commenters broadly flagged that a single novel-protocol contest is a narrow signal; performance on 3D spatial reasoning, long-context, and tool-use tasks tells a different story for Kimi.
Debate split between “open-weights parity is real and accelerating” and “task-specific wins don’t generalize” – both sides cite their own internal evals with contradictory conclusions.
Practical operators noted the scoring penalty for short words as a proxy for instruction-following under structured constraints, with Muse’s -15,309 score cited as a concrete failure mode worth watching for production deployments.
Notable Comments
@sieve: Reports Kimi consistently beat Sonnet on a real C+Python compiler/VM project on OpenCode Go plan, never hitting context limits.
@ponyous: Kimi fails on 3D model code generation evals – “lacks spatial understanding and makes many more code errors before it succeeds.”