Alibaba’s Qwen3.7-Max is a proprietary agent-focused model claiming top-tier scores on SWE-Bench, MCP, and reasoning benchmarks, available soon via Alibaba Cloud Model Studio.
Key Takeaways
Beats or matches Claude Opus 4.6 and DeepSeek-V4-Pro on SWE-Verified (80.4), GPQA Diamond (92.4), Terminal Bench 2.0 (69.7), and MCP-Atlas (76.4).
In a 35-hour autonomous run with 1,158 tool calls, it achieved 10x geometric mean speedup on SGLang’s Extend Attention kernel on unseen T-Head ZW-M890 PPU hardware.
Cross-harness generalization is a core design goal: training decouples Task, Harness, and Verifier so the model performs consistently across Claude Code, OpenClaw, Qwen Code, and custom scaffolds.
Environment scaling (expanding diversity of agentic training environments) drives capability gains that generalize to out-of-domain benchmarks, not just tuned eval sets.
No open weights yet; API access on Alibaba Cloud Model Studio listed as “coming soon.”
Hacker News Comment Review
Commenters broadly flag cherry-picked comparisons: benchmarks use Opus 4.6 and older baselines while skipping GPT-4.7, Claude 4.7, and GPT-o3, which are available and likely stronger.
Community interest is high for open-weight releases in the 60-150B range, particularly a MoE variant around 120B-a14B for prosumer hardware.
No real-world coding agent usage reports surfaced in discussion, leaving benchmark claims unvalidated by practitioners.
Notable Comments
@tarruda: Specifically calls out desire for open-weight 122B and 397B releases from Qwen.