WSU Study Says ChatGPT Flunks Basic Reality Check

Washington State University researchers put ChatGPT under the microscope and found that the popular chatbot can be both wrong and fickle when asked to judge scientific hypotheses. In tests on 719 hypotheses pulled from academic business journals, the system delivered the right answer roughly three-quarters of the time overall, but it struggled badly when the statements were actually false. The findings have reignited worries about leaning on conversational AI for complex or high-stakes decisions.

As reported by News4SanAntonio, Washington State University associate professor Mesut Cicek said the real issue was not just getting things wrong, but doing so unpredictably. The team sometimes watched identical prompts produce opposite true-or-false answers across ten separate runs, a kind of intellectual whiplash that undercuts trust in the tool.

The full study, published in the Rutgers Business Review, lays out how the researchers assembled 719 hypotheses from nine leading marketing and management journals, then asked ChatGPT whether each one was supported by the published research. They did the experiment twice, first in 2024 with the free ChatGPT-3.5 and again in 2025 with the free ChatGPT-5 mini, repeating every question 10 times to track stability. Reported accuracy rose modestly from 76.5% in 2024 to 80% in 2025, but after adjusting for chance, the authors say the model is effectively correct only about 60% of the time. When it came to spotting false, or statistically insignificant, hypotheses, performance in 2025 dropped to just 16.4%.

The uneven results line up with earlier coverage of AI hallucinations and wobbly citations, where chatbots confidently deliver different answers to the same question or attribute information to the wrong outlet. The Columbia Journalism Review has documented similar variability in how language models reference publishers and stitch together their explanations.

What the Researchers Tested

To keep things clean, the WSU team limited their sample to open-access journal articles so paywalls would not interfere with the experiment. From those papers they pulled 719 testable hypothesis statements and posed a straightforward question to ChatGPT: was each hypothesis supported by the research or not? The entire set was run in 2024 on the free ChatGPT-3.5, then rerun in 2025 on the free ChatGPT-5 mini, with every hypothesis asked 10 times to see how often the system would stick to its own answer. The authors spell out the full methodology and results in the Rutgers Business Review.

Why Managers and Schools Should Care

The researchers argue that for now, AI tools like ChatGPT should be treated as brainstorming partners, not final authorities. Businesses and educators, they say, need to double-check any AI-generated claims before acting on them. As News4SanAntonio reported, the authors also call for training and oversight so staff and students understand where chatbots are likely to stumble.

For all the slick language and confident tone, the study’s takeaway is blunt: these systems can sound right while being wrong. Human judgment, the authors argue, is still non-negotiable.

San Francisco/ Washington, D.C./ Bay Area-

Explore Our Cities & Metro Areas (A-Z)

WSU Study Says ChatGPT Flunks Basic Reality Check

What the Researchers Tested

Why Managers and Schools Should Care

Trending in Washington, D.C.

National

Yes, That Earthquake Was Real — And No, It Wasn't the Big One (But the USGS Did Downgrade It)

Oakland Firefighters Gave a Smoke-Choked Pigeon Oxygen — and 11 Million People Lost It

FBI Crushes National Predator Ring; 205 Arrested, 115 Children Saved in LA, SF, DC, Chicago, & NY

Bay Area's Oliver Tree Killed When Two Helicopters Smashed Into Each Other Above Rio de Janeiro

San Francisco Appeals Court Says Hiring a Hit Man Is "Not Categorically a Crime of Violence"

Mission Curry House Ordered Closed After Inspector Finds Live Cockroaches Inside Both Ovens

Beach-Blocking Billionaire Ditches the Niners to Buy the Team That Ruined Their Season

Floyd Mayweather Charged With Felony Theft Over $200K Watch Bought With a Bounced Check

Founder of Real SF Startup Is Cutting Up Banned Target Bags and Calling Them Dog Raincoats. They're $2.

Disgraced Bank Exec Edward Gene Smith Pleads Guilty to Decade-Long Sexual Assault and Child Porn Scandal in New York

Explore Our Cities & Metro Areas (A-Z)

WSU Study Says ChatGPT Flunks Basic Reality Check

What the Researchers Tested

Why Managers and Schools Should Care

Trending in Washington, D.C.

National

Yes, That Earthquake Was Real — And No, It Wasn't the Big One (But the USGS Did Downgrade It)

Oakland Firefighters Gave a Smoke-Choked Pigeon Oxygen — and 11 Million People Lost It

FBI Crushes National Predator Ring; 205 Arrested, 115 Children Saved in LA, SF, DC, Chicago, & NY

Bay Area's Oliver Tree Killed When Two Helicopters Smashed Into Each Other Above Rio de Janeiro

San Francisco Appeals Court Says Hiring a Hit Man Is "Not Categorically a Crime of Violence"

Mission Curry House Ordered Closed After Inspector Finds Live Cockroaches Inside Both Ovens

Beach-Blocking Billionaire Ditches the Niners to Buy the Team That Ruined Their Season

Floyd Mayweather Charged With Felony Theft Over $200K Watch Bought With a Bounced Check

Founder of Real SF Startup Is Cutting Up Banned Target Bags and Calling Them Dog Raincoats. They're $2.

Disgraced Bank Exec Edward Gene Smith Pleads Guilty to Decade-Long Sexual Assault and Child Porn Scandal in New York

Subscribe to Hoodline

Pick the Hoodline cities you actually want.