Harvard study in Science finds OpenAI o1 outperformed triage doctors on ER diagnosis accuracy using only EHR text data.
Key Takeaways
o1 correctly diagnosed 67% of 76 Boston ER patients vs. 50-55% for human doctor pairs given identical electronic health records.
With more detail available, o1 hit 82% accuracy vs. 70-79% for expert humans, though that gap was not statistically significant.
On treatment planning across 5 clinical cases, o1 scored 89% vs. 34% for 46 doctors using conventional search resources.
The AI had no access to visual or behavioral patient cues, making it closer to a second-opinion tool than a full clinical replacement.
Lead authors frame the outcome as a “triadic care model” where AI joins doctor and patient, not replaces the physician.
Hacker News Comment Review
Core methodological skepticism: the benchmark heavily favors LLMs by stripping out the clinical interview, physical exam, and nonverbal cues that define real triage, making human scores artificially low.
Commenters raised benchmark contamination risks, citing a separate paper where an AI beat radiologists on chest X-ray interpretation without actually receiving the images, suggesting leakage can silently inflate AI scores.
Practical concern: if AI becomes routine, clinicians may unconsciously defer to its output rather than reason independently, compounding errors in edge cases the model handles poorly.
Notable Comments
@gpm: flags a preprint where AI beat radiologists on a visual QA benchmark despite having no access to the X-rays, urging caution before trusting similar benchmarks.
@OptionOfT: personal case where multiple models missed severe bilateral hip degeneration on X-ray, with Gemini describing a non-existent hip; warns against over-relying on current vision models.