The Human Creativity Benchmark – Evaluating Generative AI in Creative Work

Apr 30, 2026 · ai coding design · Source ↗

TLDR

Contra Labs research proposes separating AI creative evaluation into convergence (best practices) and divergence (taste) signals, finding no current model is reliably both correct and steerable.

Standard benchmarks flatten evaluator disagreement into noise; the Human Creativity Benchmark treats taste divergence as signal, not error to reconcile.
Study drew ~15,000 judgments from 1.5M+ verified Contra network creatives across five domains: landing pages, desktop apps, ad images, brand assets, and product videos.
Three evaluation methods combined: pairwise forced-ranking (Bradley-Terry ELO), scalar Likert ratings on prompt adherence/usability/visual appeal, and open-ended qualitative follow-ups.
Prompt adherence and usability produce higher evaluator agreement; visual appeal remains structurally divergent across all domains, especially ad video and brand assets.
Current generative models trend toward mode collapse – converging on safe averaged aesthetics – which fails professional creative workflows that depend on differentiated directional output.