Chelsea Finn: Building Robots That Can Do Anything
Chelsea Finn explains how Physical Intelligence trained robots to fold laundry, tidy unseen homes, and follow open-ended prompts using a pre-train/fine-tune recipe borrowed from LLMs.
- Physical Intelligence’s breakthrough: pre-training on all robot data then fine-tuning on a small curated high-quality dataset (borrowed from LLM recipe) unlocked reliable laundry folding after 2-3 months of 0% success rates.
- Laundry folding robot went from 20 min for 5 items to 9 min after switching to a 3B-parameter vision-language model (PaliGemma) pre-trained across all robot tasks — 10x larger than prior 100-300M models.
- Mobile manipulation data was only 2.4% of the pre-training mix, yet the model generalized to tidying unseen Airbnb kitchens and bedrooms with ~80% task success.
- Diverse environments in training data closed the generalization gap almost entirely — performance in novel homes matched performance in seen homes when enough distinct locations were included.
- Early models ignored language instructions 80% of the time; stopping gradients from the randomly-initialized diffusion head preserved VLM language-following, flipping the rate to 80% compliance.
- Synthetic prompt relabeling — using a VLM to generate hypothetical human prompts for existing robot data — enabled open-ended instruction following (e.g., ‘make me a vegan sandwich, no pickles’) without costly human-robot interaction data collection.
- Frontier models (GPT/Claude-class) used as high-level planners scored substantially lower than Physical Intelligence’s trained high-level policy on task progress, due to weak visual grounding in physical contexts.
- Finn argues real robot data is irreplaceable for generalization; RL on live robot attempts is the robotics analog to synthetic data in LLM post-training.
2025-07-22 · Watch on YouTube