OpenAI's o1 correctly diagnosed 67% of ER patients vs. 50-55% by triage doctors

· ai · Source ↗

TLDR

  • Harvard study in Science finds OpenAI o1 outperformed triage doctors on ER diagnosis accuracy using only EHR text data.

Key Takeaways

  • o1 correctly diagnosed 67% of 76 Boston ER patients vs. 50-55% for human doctor pairs given identical electronic health records.
  • With more detail available, o1 hit 82% accuracy vs. 70-79% for expert humans, though that gap was not statistically significant.
  • On treatment planning across 5 clinical cases, o1 scored 89% vs. 34% for 46 doctors using conventional search resources.
  • The AI had no access to visual or behavioral patient cues, making it closer to a second-opinion tool than a full clinical replacement.
  • Lead authors frame the outcome as a “triadic care model” where AI joins doctor and patient, not replaces the physician.

Hacker News Comment Review

  • Core methodological skepticism: the benchmark heavily favors LLMs by stripping out the clinical interview, physical exam, and nonverbal cues that define real triage, making human scores artificially low.
  • Commenters raised benchmark contamination risks, citing a separate paper where an AI beat radiologists on chest X-ray interpretation without actually receiving the images, suggesting leakage can silently inflate AI scores.
  • Practical concern: if AI becomes routine, clinicians may unconsciously defer to its output rather than reason independently, compounding errors in edge cases the model handles poorly.

Notable Comments

  • @gpm: flags a preprint where AI beat radiologists on a visual QA benchmark despite having no access to the X-rays, urging caution before trusting similar benchmarks.
  • @OptionOfT: personal case where multiple models missed severe bilateral hip degeneration on X-ray, with Gemini describing a non-existent hip; warns against over-relying on current vision models.

Original | Discuss on HN