Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar
Hamel Husain and Shreya Shankar argue that manual error analysis on real traces—not AI automation—is the highest-ROI activity for AI product builders.
- Their Maven evals course has trained 2,000+ PMs and engineers across 500 companies, including large portions of OpenAI and Anthropic teams.
- The first and most important step is manual “open coding”: a single domain expert reads 40-100 real LLM traces and writes freeform notes on the first error they spot in each.
- LLMs fail at automated error analysis because they lack product context—ChatGPT will call a hallucinated virtual-tour response “correct” without knowing the feature doesn’t exist.
- After open coding, feed all notes to an LLM to cluster them into 5-6 “axial codes” (categories); this is where LLM automation is appropriate and saves time.
- LLM-as-judge scores should be binary, not 1-5 scales; validate the judge against human labels before trusting it in CI.
- Upfront time investment is 3-4 days; ongoing maintenance drops to ~30 minutes per week once a cron-based sampling pipeline is in place.
- Husain argues “evals” is just data science applied to LLM products—AB tests, thumbs-up metrics, and error analysis are all part of the same spectrum, not competing approaches.
- Dogfooding fails for most AI products because users (e.g., doctors) are not tolerant power-users like engineers, so it cannot substitute for systematic trace review.
2025-09-25 · Watch on YouTube