
A Harvard-led team says a cutting-edge OpenAI “reasoning” model just pulled ahead of human doctors on several text-based diagnostic and management tasks, including a real emergency department test. In one key experiment, the model landed the exact or very-close diagnosis about 67% of the time at triage, compared with roughly 55% and 50% for two physician reviewers. As more chart data was added, the gap shrank, and the researchers are quick to say this is a milestone in testing performance, not a green light to let an algorithm run the ER alone.
What the Science paper actually did
The research team put OpenAI’s o1-preview model through six clinical reasoning experiments that went well beyond the usual short hypotheticals. The lineup included curated case challenges, management planning tasks, and a real-world check using 76 randomly selected emergency department charts from Beth Israel Deaconess. Evaluators who scored both the model’s and the clinicians’ answers were blinded to which was which, and across most tasks the model matched or beat average physician performance. As reported in Science, the study was designed to push evaluation closer to realistic chart-based review instead of tidy, classroom-style vignettes.
Local authors, cautious celebration
Several senior authors are based in Boston, including Adam Rodman of Beth Israel Deaconess and Arjun Manrai at Harvard, who are listed among the paper’s leads. They told reporters the results are a strong signal of how far these reasoning models have come. At the same time, Rodman stressed that impressive retrospective scores are not the same as safe bedside care, and he called for carefully controlled prospective trials in comments to The Boston Globe. Co-author Peter Brodeur summed up the next step in plain terms, saying the goal is to run trials that follow AI and clinicians working together over time.
Real-world pilots show promise and caveats
Early deployments in actual clinics suggest these tools can help when they are used as assistants rather than replacements. In 2025, OpenAI and Penda Health reported a prospective rollout of an electronic medical record copilot that was linked to roughly 16% fewer diagnostic errors and 13% fewer treatment errors across nearly 40,000 primary care visits. In a separate randomized trial in Pakistan, doctors who completed structured AI literacy training and then used a large language model scored substantially higher on diagnostic reasoning vignettes than clinicians relying on standard resources. Those pilots hint at real gains, but they also underscore that training, workflow design, and auditing matter more than any single accuracy percentage.
Limits: text-only inputs, automation bias and missing outcomes
Experts are also highlighting what this new work does not show. The Science experiments only used chart text, with no imaging, physical exam, or bedside interaction. Scoring depended on clinical judgment about what counted as a “very close” diagnosis, and the study did not track whether patients actually did better or worse. Independent commentators told the Science Media Centre that the paper should spark randomized clinical trials, not overnight rollouts. A practical worry many raised is automation bias, the risk that clinicians will lean too heavily on confident AI recommendations unless systems and training explicitly teach them how to question the machine.
Regulatory and safety road map
Regulators are already sketching out the guardrails. In January 2026, the U.S. Food and Drug Administration updated its Clinical Decision Support guidance to clarify when software counts as a regulated medical device and when it is treated as an exempt CDS tool. That line determines whether a large language model assistant needs premarket review, a Predetermined Change Control Plan, and ongoing post-market monitoring. Any hospital considering a pilot will need technical safeguards, monitoring pipelines, and structured clinician training if it wants to stay on the right side of both safety expectations and legal requirements.
Where this leaves clinicians and patients
For Boston clinicians, and for health systems everywhere, the message is practical rather than sci-fi. The Science paper raises expectations for what reasoning models can do with text alone, but it also makes clear that the real test is still ahead: prospective, randomized, closely monitored trials that look at patient outcomes, workflow impact, and unintended harms. For now, the defensible approach is a three-way model in which AI offers a checked second opinion inside tightly designed workflows, and clinicians remain the ones making the final call instead of handing the keys to a chatbot.









