I benchmarked Claude Code's caveman plugin against "be brief."

· ai · Source ↗

TLDR

  • Solo dev ran 24 prompts across 5 arms (baseline, “be brief.”, caveman lite/full/ultra) and found two words matched the plugin on tokens and quality.

Key Takeaways

  • “Be brief.” cut tokens 34% vs baseline (636 to 419 mean); caveman lite and full landed close, ultra averaged 449 – longer than brief.
  • All 5 arms scored within 1.5% of each other on correctness; 100% key-point coverage, zero must_avoid triggers across 120 scored responses.
  • Caveman’s Auto-Clarity escape deliberately suspends compression for destructive ops and multi-step sequences – that design choice explains token variance in setup and security categories, not a bug.
  • Ultra triggered tool-use behavior on a Dockerfile prompt (Write tool call, blocked, dumped file inline), adding ~1300 tokens to its setup category mean – a compression-style side effect.
  • Caveman’s real differentiator is consistent output structure via SessionStart/UserPromptSubmit hooks and a slash-command intensity dial – neither is replicable with two words in CLAUDE.md.

Hacker News Comment Review

  • Author confirmed the benchmark design: 24 prompts, 5 arms, rubric-judged by claude-sonnet-4-6 against required facts, required terms, and must_avoid traps – methodology is open source.
  • One commenter dismissed caveman outright on UX grounds, arguing compressed robot-speak degrades the interaction regardless of token savings.

Notable Comments

  • @lofaszvanitt: “Caveman is useless for me” – frames the tradeoff as comfort vs. efficiency, not tokens vs. quality.

Original | Discuss on HN