Boston

MIT Study Explores Human Perception Challenges in Evaluating Large Language Models' Performance

AI Assisted Icon
Published on July 23, 2024
MIT Study Explores Human Perception Challenges in Evaluating Large Language Models' PerformanceSource: MIT

As artificial intelligence continues to seep into everyday life, MIT researchers have recently shed light on a crucial aspect of how large language models (LLMs) work—or rather, how they don't work—like us fleshy denizens. The crux of the situation is that these models, while remarkable in their breadth of application, from drafting emails to aiding in medical diagnostics, present a unique conundrum for testing and evaluation given their generalist nature. The recent study, which will be presented at the International Conference on Machine Learning, took a distinct route to challenge the traditional benchmarking of these models.

A key finding from the research, as reported by MIT News, is the development of a human generalization function—a way to model and understand how humans adjust their expectations of an LLM after working with it. Researchers are keen to point out a simple but profound truth: It's humans who ultimately decide when to deploy these models. And so, understanding the human aspect of this equation becomes as crucial as understanding the models themselves. They’re looking to bridge that gap, noting that misalignment with the human generalization function might cause the model to fail unexpectedly.

Consider the graduate student or the clinician referenced in the study, as they assess when and how to utilize a language model. The framework seeks to measure and align the LLM's functionality with a user's beliefs about the model's capabilities. "These tools are exciting because they are general-purpose, but because they are general-purpose, they will be collaborating with people, so we have to take the human in the loop into account," Ashesh Rambachan, an assistant professor of economics at MIT and a principal investigator in the Laboratory for Information and Decision Systems (LIDS), told MIT News.

However, the survey administered by the researchers to measure how people generalize from interactions with LLMs and with other humans revealed some intriguing gaps in human perception. Participants were notably less successful at predicting an LLM's capabilities based on prior answers to related questions. "Language models that get better can almost trick people into thinking they will perform well on related questions when, in actuality, they don’t," Rambachan highlighted in the survey findings, according to MIT News. This emphasizes a stark difference between human expertise and machine learning performance.

Contributing to the challenge is the relative novelty of LLMs in the public consciousness. We lack experience with them compared to interacting with fellow humans, which may bias our expectations. However, the researchers posit that our aptitude to correctly predict LLM performance could improve merely through increased interaction. Rambachan suggests that, even as we seek to train and update these algorithms, we must account for the human generalization function to better measure performance.

With this study, MIT researchers aim to provide a framework to enhance LLMs for real-world applications by addressing a critical gap in our understanding of them. Alex Imas, a professor of behavioral science and economics at the University of Chicago’s Booth School of Business and not involved in the study, summarizes the situation: "The paper uncovers a critical issue with deploying LLMs for general consumer use. If people don't have the right understanding of when LLMs will be accurate and when they will fail, then they will be more likely to see mistakes and perhaps be discouraged from further use," as noted by MIT News. Imas also highlights the research’s deeper insight into the reasoning behind LLMs' problem-solving abilities. The study’s findings could serve as a benchmark for future LLM development, potentially shaping how we interact with and what we expect from artificial intelligence.

Boston-Science, Tech & Medicine