Boston study finds AI outperforms ER doctors

Boston’s New AI Hotshot Out-Triages ER Docs, Study Finds

A Harvard-led team says a cutting-edge OpenAI “reasoning” model just pulled ahead of human doctors on several text-based diagnostic and management tasks, including a real emergency department test. In one key experiment, the model landed the exact or very-close diagnosis about 67% of the time at triage, compared with roughly 55% and 50% for two physician reviewers. As more chart data was added, the gap shrank, and the researchers are quick to say this is a milestone in testing performance, not a green light to let an algorithm run the ER alone.

What the Science paper actually did

The research team put OpenAI’s o1-preview model through six clinical reasoning experiments that went well beyond the usual short hypotheticals. The lineup included curated case challenges, management planning tasks, and a real-world check using 76 randomly selected emergency department charts from Beth Israel Deaconess. Evaluators who scored both the model’s and the clinicians’ answers were blinded to which was which, and across most tasks the model matched or beat average physician performance. As reported in Science, the study was designed to push evaluation closer to realistic chart-based review instead of tidy, classroom-style vignettes.

Local authors, cautious celebration

Several senior authors are based in Boston, including Adam Rodman of Beth Israel Deaconess and Arjun Manrai at Harvard, who are listed among the paper’s leads. They told reporters the results are a strong signal of how far these reasoning models have come. At the same time, Rodman stressed that impressive retrospective scores are not the same as safe bedside care, and he called for carefully controlled prospective trials in comments to The Boston Globe. Co-author Peter Brodeur summed up the next step in plain terms, saying the goal is to run trials that follow AI and clinicians working together over time.

Real-world pilots show promise and caveats

Early deployments in actual clinics suggest these tools can help when they are used as assistants rather than replacements. In 2025, OpenAI and Penda Health reported a prospective rollout of an electronic medical record copilot that was linked to roughly 16% fewer diagnostic errors and 13% fewer treatment errors across nearly 40,000 primary care visits. In a separate randomized trial in Pakistan, doctors who completed structured AI literacy training and then used a large language model scored substantially higher on diagnostic reasoning vignettes than clinicians relying on standard resources. Those pilots hint at real gains, but they also underscore that training, workflow design, and auditing matter more than any single accuracy percentage.

Limits: text-only inputs, automation bias and missing outcomes

Experts are also highlighting what this new work does not show. The Science experiments only used chart text, with no imaging, physical exam, or bedside interaction. Scoring depended on clinical judgment about what counted as a “very close” diagnosis, and the study did not track whether patients actually did better or worse. Independent commentators told the Science Media Centre that the paper should spark randomized clinical trials, not overnight rollouts. A practical worry many raised is automation bias, the risk that clinicians will lean too heavily on confident AI recommendations unless systems and training explicitly teach them how to question the machine.

Regulatory and safety road map

Regulators are already sketching out the guardrails. In January 2026, the U.S. Food and Drug Administration updated its Clinical Decision Support guidance to clarify when software counts as a regulated medical device and when it is treated as an exempt CDS tool. That line determines whether a large language model assistant needs premarket review, a Predetermined Change Control Plan, and ongoing post-market monitoring. Any hospital considering a pilot will need technical safeguards, monitoring pipelines, and structured clinician training if it wants to stay on the right side of both safety expectations and legal requirements.

Where this leaves clinicians and patients

For Boston clinicians, and for health systems everywhere, the message is practical rather than sci-fi. The Science paper raises expectations for what reasoning models can do with text alone, but it also makes clear that the real test is still ahead: prospective, randomized, closely monitored trials that look at patient outcomes, workflow impact, and unintended harms. For now, the defensible approach is a three-way model in which AI offers a checked second opinion inside tightly designed workflows, and clinicians remain the ones making the final call instead of handing the keys to a chatbot.

Boston-

Explore Our Cities & Metro Areas (A-Z)

Boston’s New AI Hotshot Out-Triages ER Docs, Study Finds

What the Science paper actually did

Local authors, cautious celebration

Real-world pilots show promise and caveats

Limits: text-only inputs, automation bias and missing outcomes

Regulatory and safety road map

Where this leaves clinicians and patients

Trending in Boston

National

EXCLUSIVE: Secretive Billionaire Owner of Anchor Brewing Just Officially Registered a Name at the Iconic SF Brewery

Yes, That Earthquake Was Real — And No, It Wasn't the Big One (But the USGS Did Downgrade It)

Vegas Duo's Wound Care Windfall Part of Feds' Record $14.6B Healthcare Fraud Blitz

Trump's DHS Chief Wants to Kill International Travel at Sanctuary Cities — SFO, LAX and JFK Are on the List

TikTok Army Invades Pacific Heights Estate Sale as Young San Franciscans Hunt Discounted Treasures

New York Firefighter Witnessed 'Sniper Tourists' Hunting Bosnian Civilians; Italy Now Investigating

FBI Crushes National Predator Ring; 205 Arrested, 115 Children Saved in LA, SF, DC, Chicago, & NY

Tom Steyer’s Odds for CA Governor Surged from 7% to 69% on Polymarket. Here’s why.

Laurel Heights Mega Project Presidio Highlands Finally Gets A Date With The Bulldozers

San Diego 'King of Coke' Rodolfo Silva Sentenced to Over 17 Years for Drug Conspiracy and Violent Cartel Operations

Explore Our Cities & Metro Areas (A-Z)

Boston’s New AI Hotshot Out-Triages ER Docs, Study Finds

What the Science paper actually did

Local authors, cautious celebration

Real-world pilots show promise and caveats

Limits: text-only inputs, automation bias and missing outcomes

Regulatory and safety road map

Where this leaves clinicians and patients

Trending in Boston

National

EXCLUSIVE: Secretive Billionaire Owner of Anchor Brewing Just Officially Registered a Name at the Iconic SF Brewery

Yes, That Earthquake Was Real — And No, It Wasn't the Big One (But the USGS Did Downgrade It)

Vegas Duo's Wound Care Windfall Part of Feds' Record $14.6B Healthcare Fraud Blitz

Trump's DHS Chief Wants to Kill International Travel at Sanctuary Cities — SFO, LAX and JFK Are on the List

TikTok Army Invades Pacific Heights Estate Sale as Young San Franciscans Hunt Discounted Treasures

New York Firefighter Witnessed 'Sniper Tourists' Hunting Bosnian Civilians; Italy Now Investigating

FBI Crushes National Predator Ring; 205 Arrested, 115 Children Saved in LA, SF, DC, Chicago, & NY

Tom Steyer’s Odds for CA Governor Surged from 7% to 69% on Polymarket. Here’s why.

Laurel Heights Mega Project Presidio Highlands Finally Gets A Date With The Bulldozers

San Diego 'King of Coke' Rodolfo Silva Sentenced to Over 17 Years for Drug Conspiracy and Violent Cartel Operations

Subscribe to Hoodline

Pick the Hoodline cities you actually want.