
Washington State University researchers put ChatGPT under the microscope and found that the popular chatbot can be both wrong and fickle when asked to judge scientific hypotheses. In tests on 719 hypotheses pulled from academic business journals, the system delivered the right answer roughly three-quarters of the time overall, but it struggled badly when the statements were actually false. The findings have reignited worries about leaning on conversational AI for complex or high-stakes decisions.
As reported by News4SanAntonio, Washington State University associate professor Mesut Cicek said the real issue was not just getting things wrong, but doing so unpredictably. The team sometimes watched identical prompts produce opposite true-or-false answers across ten separate runs, a kind of intellectual whiplash that undercuts trust in the tool.
The full study, published in the Rutgers Business Review, lays out how the researchers assembled 719 hypotheses from nine leading marketing and management journals, then asked ChatGPT whether each one was supported by the published research. They did the experiment twice, first in 2024 with the free ChatGPT-3.5 and again in 2025 with the free ChatGPT-5 mini, repeating every question 10 times to track stability. Reported accuracy rose modestly from 76.5% in 2024 to 80% in 2025, but after adjusting for chance, the authors say the model is effectively correct only about 60% of the time. When it came to spotting false, or statistically insignificant, hypotheses, performance in 2025 dropped to just 16.4%.
The uneven results line up with earlier coverage of AI hallucinations and wobbly citations, where chatbots confidently deliver different answers to the same question or attribute information to the wrong outlet. The Columbia Journalism Review has documented similar variability in how language models reference publishers and stitch together their explanations.
What the Researchers Tested
To keep things clean, the WSU team limited their sample to open-access journal articles so paywalls would not interfere with the experiment. From those papers they pulled 719 testable hypothesis statements and posed a straightforward question to ChatGPT: was each hypothesis supported by the research or not? The entire set was run in 2024 on the free ChatGPT-3.5, then rerun in 2025 on the free ChatGPT-5 mini, with every hypothesis asked 10 times to see how often the system would stick to its own answer. The authors spell out the full methodology and results in the Rutgers Business Review.
Why Managers and Schools Should Care
The researchers argue that for now, AI tools like ChatGPT should be treated as brainstorming partners, not final authorities. Businesses and educators, they say, need to double-check any AI-generated claims before acting on them. As News4SanAntonio reported, the authors also call for training and oversight so staff and students understand where chatbots are likely to stumble.
For all the slick language and confident tone, the study’s takeaway is blunt: these systems can sound right while being wrong. Human judgment, the authors argue, is still non-negotiable.









