Evals Will Break and You Won't See It Coming

· ai · Source ↗

TLDR

  • Evaluation infrastructure is structurally reactive and calibrated for current capability regimes, making it blind to qualitative shifts before they happen.

Key Takeaways

  • Benchmarks like GPQA, SWE-bench, and ARC-AGI measure current model behavior; none predict post-regime-change capabilities.
  • Schaeffer et al. (2023) showed many emergent ability “jumps” are metric artifacts, but this sharpens the problem: we still can’t distinguish real transitions from measurement noise.
  • The author frames eval as upstream of training: bad evals corrupt training objectives, safety layers, and scaling decisions silently via Goodhart dynamics.
  • Shan, Li, and Sompolinsky (PNAS 2026) derived statistical-mechanics order parameters that predict phase transitions in continual learning; Nanda et al. (2023) found mechanistic progress measures predicting grokking before it surfaces.
  • Proposed fix: self-evolving eval systems that co-evolve with models, monitoring meta-signals like benchmark score distribution shifts and cross-eval correlation structure changes.

Hacker News Comment Review

  • Thin discussion so far, but one concrete gap flagged: conflicting evals and whether desirable emergent behaviors get penalized because current benchmarks are not designed to accommodate them.

Original | Discuss on HN