
Robots are stepping up their game with a little help from smarter AI models courtesy of MIT's Improbable AI Lab. The team, part of the Computer Science and Artificial Intelligence Laboratory (CSAIL), has introduced a new framework that's making it easier, and more transparent, for robots to plan and execute complex tasks, whether in your kitchen or on the manufacturing floor. The framework, known as Compositional Foundation Models for Hierarchical Planning (HiP), leverages a trio of AI models trained separately on language, vision, and action data – a departure from previous attempts that relied heavily on paired data. This approach is not only more economical but is also shaping up to streamline and smoothen future robotics applications, per MIT news release.
Executing what humans might see as rudimentary chores like fetching groceries or washing dishes is, for robots, a longer and more convoluted process. They need to be explicitly instructed to, for example, pick up the first dirty dish or to scrub that plate with a sponge. To assist them, HiP provides a detailed plan of attack, developed with the expertise of each of the foundation models. According to MIT News, these models have been trained on vast swaths of data, much like OpenAI's GPT-4 was used to train the likes of ChatGPT and Bing Chat.
"Foundation models do not have to be monolithic," said Jim Fan, an NVIDIA AI researcher not involved in the paper, pointed out in his praise for the project. "This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision-making problem more tractable and transparent", Fan added. Robots armed with HiP could potentially take on anything from organizing your bookshelf to intricate construction tasks, like arranging materials in a specific sequence.
The ingenuity of HiP has been tested and proved, as the CSAIL team showcased its robots adeptly handling manipulation tasks, often outperforming traditional frameworks. In one scenario, a robot had to quickly adapt when required blocks for a construction task were found missing. It smoothly navigated around this by substituting missing colored blocks with white ones painted specifically for the task. These tests indicate that HiP is not just capable of improvising but also effectively coordinating between its constituent models to accomplish tasks requiring nuanced adjustments.
HiP's capability extends through a hierarchical process: At the base, a large language model conceptualizes an abstract plan which is then refined by a video diffusion model that adds physical world context. Lastly, an action model actualizes the plan in real time. Anurag Ajay, a Ph.D. student at MIT and a CSAIL affiliate, summarizes the goal by stating, "All we want to do is take existing pre-trained models and have them successfully interface with each other." The key is to leverage the synergy between models trained on different data sets for more effective robotic decision-making.
These advancements hint at a future where robots might become an even more integral part of everyday life, taking on more complex tasks with greater independence. HiP's success story is supported by major players like the National Science Foundation and the U.S. Defense Advanced Research Projects Agency, evidence of broad institutional belief in this more collaborative, multimodal approach to AI and robotics. More than just a theoretical framework, HiP's contribution to robot autonomy will likely journey from laboratory tests to real-world scenarios, opening up new possibilities across industries.









