The Powerful Alternative To Fine-Tuning
Poetiq CEO Ian Fischer explains how a 7-person ex-DeepMind team beat Claude Opus 4.6 on Humanity’s Last Exam using recursive self-improvement harnesses instead of fine-tuning.
- Poetiq’s system beat Anthropic’s Claude Opus 4.6 on Humanity’s Last Exam: 55% vs 53.1%, at optimization cost under $100k.
- On ARC-AGI v2, Poetiq scored 54% vs Gemini 3 Deep Think’s 45%, at half the cost ($32/problem vs ~$70+).
- Adding reasoning harnesses on top of prompts took one benchmark task from 5% to 95% performance with Gemini 1.5 Flash.
- Fine-tuning is a trap for startups: costs millions, then next frontier model renders it obsolete; harnesses stay model-agnostic.
- The Poetiq meta-system auto-generates reasoning strategies in code, not just better prompts — DSPY-style but recursively self-improving.
- The generated prompts for ARC-AGI included a factually wrong example that improved performance — the system found non-human strategies.
- Entire company is 7 people (research scientists and engineers); no harness retraining needed when new base models release.
2026-02-27 · Watch on YouTube