Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar

· ai · Source ↗

Summary based on the YouTube transcript and episode description. Prompt input used 79979 of 105137 transcript characters.

Hamel Husain and Shreya Shankar argue that manual error analysis on real traces—not AI automation—is the highest-ROI activity for AI product builders.

  • Their Maven evals course has trained 2,000+ PMs and engineers across 500 companies, including large portions of OpenAI and Anthropic teams.
  • The first and most important step is manual “open coding”: a single domain expert reads 40-100 real LLM traces and writes freeform notes on the first error they spot in each.
  • LLMs fail at automated error analysis because they lack product context—ChatGPT will call a hallucinated virtual-tour response “correct” without knowing the feature doesn’t exist.
  • After open coding, feed all notes to an LLM to cluster them into 5-6 “axial codes” (categories); this is where LLM automation is appropriate and saves time.
  • LLM-as-judge scores should be binary, not 1-5 scales; validate the judge against human labels before trusting it in CI.
  • Upfront time investment is 3-4 days; ongoing maintenance drops to ~30 minutes per week once a cron-based sampling pipeline is in place.
  • Husain argues “evals” is just data science applied to LLM products—AB tests, thumbs-up metrics, and error analysis are all part of the same spectrum, not competing approaches.
  • Dogfooding fails for most AI products because users (e.g., doctors) are not tolerant power-users like engineers, so it cannot substitute for systematic trace review.

2025-09-25 · Watch on YouTube