OpenAI’s Deep Research Team on Why Reinforcement Learning is the Future for AI Agents
Watch on YouTube ↗ Summary based on the YouTube transcript and episode description.
OpenAI’s Isa Fulford and Josh Tobin explain why end-to-end RL training—not prompt-chained graphs—is the architecture behind Deep Research and future agents.
- Deep Research is a fine-tuned version of o3 trained end-to-end via RL on hard browsing and reasoning tasks, not a hand-coded agent graph.
- Hand-coded operation graphs fall apart in production because humans can’t anticipate all edge cases; RL-trained models adapt dynamically to live web content.
- Sam Altman projects Deep Research will handle a single-digit percentage of all economically valuable tasks globally.
- High-quality training data was the hidden key to success—data quality is the biggest determinant of model quality.
- Clarification flow before research starts was an intentional design choice: detailed prompts yield dramatically better 5–30 minute reports.
- Future roadmap: private data sources, fused operator+browser capabilities, and RL recipe scaling to increasingly complex agentic tasks.
- Reinforcement learning is “so back” because large pretrained LMs now provide the base (the cake) that RL fine-tuning (the cherries) previously lacked.
2025-02-25 · Watch on YouTube