Can LLMs model real-world systems in TLA+?

May 9, 2026 · ai coding · Source ↗

The Specula team benchmarked frontier LLMs on generating TLA+ specs from real system code—finding near-perfect syntax scores mask catastrophic conformance failures on distributed systems.

What Matters

SysMoBench evaluates 11 real systems (Etcd, ZooKeeper, RedisRaft, CURP) across four phases: syntax, runtime, conformance, invariant.
Frontier LLMs average ~46% conformance and ~41% invariant scores despite near-100% syntax scores.
Claude reproduced the Raft paper appendix spec when asked for Etcd’s spec—a concrete recitation-vs-modeling failure.
Claude Sonnet 4.6 scores only 25% overall on RedisRaft and CURP, among the hardest systems.
Two systematic failure modes: specs enter states real systems never reach, and miss states real systems always reach.
ZooKeeper FLE example: Claude used set union for recvset instead of map-overwrite, admitting impossible multi-vote states.
Specula, their TLA+ agent built on Claude Code/Codex, achieves full conformance and invariant scores on current SysMoBench tasks.

Original | Discuss on HN

Bentos

Topics

Can LLMs model real-world systems in TLA+?

What Matters

Bentos

Topics

What Matters

Related coverage

How fast is N tokens per second really?

I Don't Vibe Code

Anthropic is expanding to Colossus2. Will use GB200

Long-term editing of brain circuits using an engineered electrical synapse