Contra Labs research proposes separating AI creative evaluation into convergence (best practices) and divergence (taste) signals, finding no current model is reliably both correct and steerable.
Key Takeaways
Standard benchmarks flatten evaluator disagreement into noise; the Human Creativity Benchmark treats taste divergence as signal, not error to reconcile.
Study drew ~15,000 judgments from 1.5M+ verified Contra network creatives across five domains: landing pages, desktop apps, ad images, brand assets, and product videos.
Three evaluation methods combined: pairwise forced-ranking (Bradley-Terry ELO), scalar Likert ratings on prompt adherence/usability/visual appeal, and open-ended qualitative follow-ups.
Prompt adherence and usability produce higher evaluator agreement; visual appeal remains structurally divergent across all domains, especially ad video and brand assets.
Current generative models trend toward mode collapse – converging on safe averaged aesthetics – which fails professional creative workflows that depend on differentiated directional output.