
MIT scientists are breaking new ground in the realm of artificial intelligence, with their latest project tackling the daunting task of explaining the behavior of neural networks. According to a report from MIT News, MIT researchers at the Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a novel method which leverages AI to autonomously investigate and elucidate the complex inner workings of other AI systems.
At the core of this innovation lies the "automated interpretability agent" (AIA), a system that mimics a scientist's experiential method. With the help of pretrained language models, these AIAs actively intervene in other systems, ranging from singular neurons to large models, just like a detective examining evidence clues, they test and seek explanations for their functions. Despite their technological sophistication, these interpretability agents are far from perfect, with their descriptions being accurate only about half the time as per findings in the new "function interpretation and description" (FIND) benchmark.
The FIND benchmark provides a controlled environment with functions that mirror the computations within trained networks, enabling a standardized evaluation of AI interpretability methods. For researchers, MIT News explains, it offers the opportunity to compare these AI explanations against detailed descriptions within the benchmark, allowing a more nuanced understanding of how effectively AI systems can mirror human deductive reasoning. The benchmark contains elements like synthetic neurons that emulate real ones in language models, specifically designed to test an AIA's ability to discern and interpret varied computational behaviors.
Sarah Schwettmann, one of the lead authors of the study and a MIT CSAIL research scientist, highlighted the potential of this approach, telling MIT News, “The AIAs’ capacity for autonomous hypothesis generation and testing may be able to surface behaviors that would otherwise be difficult for scientists to detect. It’s remarkable that language models, when equipped with tools for probing other systems, are capable of this type of experimental design,”
Even though AIAs represent a promising stride in the direction of demystifying AI behaviors, they still stumble when it comes to the finer points of certain functions. Co-lead author Tamar Rott Shaham, a postdoc at CSAIL, acknowledged the limitations and talked about integrating specific inputs to guide AIAs' exploration to substantially improve interpretation accuracy. The team's ambitions don't end there; they are developing a toolkit intended to enhance AIAs' capabilities to perform even more precise tests on neural networks. The ultimate aim, as per MIT researchers, is to create automated systems capable of auditing other AI systems, possibly contributing to safer deployment of technology in critical applications like autonomous driving and facial recognition.
The work, as reported by MIT News, was shared at the NeurIPS conference in December 2023, signaling a significant milestone in the quest towards greater AI transparency and accountability. The research has been backed by a range of sponsors, including the MIT-IBM Watson AI Lab and the U.S. National Science Foundation, reflecting the broad support for advancements in understanding AI systems.