Skip to content
Tools · May 3, 2026

Harvard study finds OpenAI's o1 model matched or exceeded emergency room physician diagnoses in 76 real cases

Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center evaluated how OpenAI's language models performed against attending physicians in diagnostic scenarios, finding one model competitive on accuracy measures — though researchers cautioned against drawing clinical conclusions.

Trust54
HypeSome hype

1 source · cross-referenced

ShareXLinkedInEmail
TL;DR
  • A Harvard Medical School study published in Science compared OpenAI's o1 and 4o models to emergency room physicians on 76 real patient cases, with o1 performing nominally better than or equal to both attending physicians at each diagnostic stage.
  • In triage cases with minimal patient information, o1 achieved exact or very close diagnoses in 67% of cases versus 55% and 50% for the two comparison physicians.
  • The study presented models with unprocessed electronic medical records available at each diagnostic decision point, without data preprocessing.
  • Researchers emphasized the findings do not indicate AI is ready to replace physicians in clinical settings and called for prospective real-world trials to evaluate such technologies.
  • Study co-author Adam Rodman noted the lack of formal accountability frameworks for AI diagnoses in clinical care.

Harvard Medical School and Beth Israel Deaconess Medical Center researchers evaluated how OpenAI's language models performed in diagnostic tasks against emergency room physicians. The study, published this week in Science, focused on 76 real patient cases from Beth Israel's emergency department, where researchers compared diagnoses generated by OpenAI's o1 and 4o models to those from two attending physicians.

Two independent attending physicians, blinded to which diagnoses came from humans versus AI, evaluated all outputs. Across diagnostic decision points, the o1 model performed nominally better than or at parity with both comparison physicians. At the initial triage stage—where patient information is most limited and clinical urgency highest—o1 achieved exact or closely matching diagnoses in 67% of cases, compared to 55% and 50% for the two physicians respectively.

The research team presented models with unprocessed electronic medical records text available at the time of each diagnostic decision, preventing researchers from selectively curating inputs to favor AI performance. Lead author Arjun Manrai stated the o1 model 'eclipsed both prior models and our physician baselines' across tested benchmarks, though this framing describes performance within a specific experimental constraint.

The researchers were explicit that their findings do not support clinical deployment or claims that AI is ready to make real emergency medicine decisions. They called for prospective trials in actual patient care settings before drawing conclusions about real-world utility. Study co-author Adam Rodman cautioned to external reporters that no formal accountability framework currently exists for AI diagnoses in clinical contexts, and that patients expect human physicians to guide them through life-threatening clinical decisions.

Sources
  1. 01TechCrunch — AIIn Harvard study, AI offered more accurate diagnoses than emergency room doctors
Also on Tools

Stories may contain errors. Dispatch is assembled with AI assistance and curated by human editors; despite the trust-score filter, mistakes happen. We correct publicly — every article links to its revision history. Nothing here is financial, legal, or medical advice. Verify before relying on any claim.

© 2026 Dispatch. No ads. No sponsorships. No paid placement. Reader-supported via Ko-fi.

Built by a person who cares about honest AI news.