The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

· ai coding design · Source ↗

TLDR

  • Contra Labs research proposes separating AI creative evaluation into convergence (best practices) and divergence (taste) signals, finding no current model is reliably both correct and steerable.

Key Takeaways

  • Standard benchmarks flatten evaluator disagreement into noise; the Human Creativity Benchmark treats taste divergence as signal, not error to reconcile.
  • Study drew ~15,000 judgments from 1.5M+ verified Contra network creatives across five domains: landing pages, desktop apps, ad images, brand assets, and product videos.
  • Three evaluation methods combined: pairwise forced-ranking (Bradley-Terry ELO), scalar Likert ratings on prompt adherence/usability/visual appeal, and open-ended qualitative follow-ups.
  • Prompt adherence and usability produce higher evaluator agreement; visual appeal remains structurally divergent across all domains, especially ad video and brand assets.
  • Current generative models trend toward mode collapse – converging on safe averaged aesthetics – which fails professional creative workflows that depend on differentiated directional output.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN