Harvard study finds OpenAI's o1 model matched or exceeded emergency room physician diagnoses in 76 real cases
Researchers at Harvard Medical School and Beth Israel Deaconess Medical Center evaluated how OpenAI's language models performed against attending physicians in diagnostic scenarios, finding one model competitive on accuracy measures — though researchers cautioned against drawing clinical conclusions.
1 source · cross-referenced
- A Harvard Medical School study published in Science compared OpenAI's o1 and 4o models to emergency room physicians on 76 real patient cases, with o1 performing nominally better than or equal to both attending physicians at each diagnostic stage.
- In triage cases with minimal patient information, o1 achieved exact or very close diagnoses in 67% of cases versus 55% and 50% for the two comparison physicians.
- The study presented models with unprocessed electronic medical records available at each diagnostic decision point, without data preprocessing.
- Researchers emphasized the findings do not indicate AI is ready to replace physicians in clinical settings and called for prospective real-world trials to evaluate such technologies.
- Study co-author Adam Rodman noted the lack of formal accountability frameworks for AI diagnoses in clinical care.
Harvard Medical School and Beth Israel Deaconess Medical Center researchers evaluated how OpenAI's language models performed in diagnostic tasks against emergency room physicians. The study, published this week in Science, focused on 76 real patient cases from Beth Israel's emergency department, where researchers compared diagnoses generated by OpenAI's o1 and 4o models to those from two attending physicians.
Two independent attending physicians, blinded to which diagnoses came from humans versus AI, evaluated all outputs. Across diagnostic decision points, the o1 model performed nominally better than or at parity with both comparison physicians. At the initial triage stage—where patient information is most limited and clinical urgency highest—o1 achieved exact or closely matching diagnoses in 67% of cases, compared to 55% and 50% for the two physicians respectively.
The research team presented models with unprocessed electronic medical records text available at the time of each diagnostic decision, preventing researchers from selectively curating inputs to favor AI performance. Lead author Arjun Manrai stated the o1 model 'eclipsed both prior models and our physician baselines' across tested benchmarks, though this framing describes performance within a specific experimental constraint.
The researchers were explicit that their findings do not support clinical deployment or claims that AI is ready to make real emergency medicine decisions. They called for prospective trials in actual patient care settings before drawing conclusions about real-world utility. Study co-author Adam Rodman cautioned to external reporters that no formal accountability framework currently exists for AI diagnoses in clinical contexts, and that patients expect human physicians to guide them through life-threatening clinical decisions.
- May 2, 2026 · TechCrunch
Meta acquires robotics startup Assured Robot Intelligence to advance humanoid robot control models
Trust63 - Apr 30, 2026 · TechCrunch — AI
Amazon's AWS cloud business posts fastest growth in 15 quarters, fueled by AI infrastructure demand
Trust71 - Apr 29, 2026 · Hugging Face
DeepInfra Now Available as Hugging Face Inference Provider with SDK Support
Trust61