Boston
AI Assisted Icon
Published on March 21, 2025
MIT and NVIDIA Unveil Groundbreaking 'HART' AI Tool, Blending Speed and Quality in Image GenerationSource: Unsplash/Steve Johnson

In a significant leap forward for artificial intelligence technology, MIT and NVIDIA researchers have unveiled a new tool that has set to quickly generate high-quality images, blending the prowess of two different AI system types. In what marks a fusion of diffusion model intricacy and autoregressive model speed, the "HART" tool stands out for its ability to produce realistic visuals efficiently enough to be run on conventional laptops or even smartphones.

The previously established systems used in image generation had their respective strengths and weaknesses, with diffusion models known for their accuracy but lacking in speed, and autoregressive models being fast but error-prone. HART has been designed to potentially outpace both by quickly capturing the bigger picture and then refining the details. It does this at a pace about nine times faster than state-of-the-art diffusion models, according to a statement obtained by MIT News.

"If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then to refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART," Haotian Tang SM ’22, PhD ’25, co-lead author of the research, told MIT News. This approach entails using an autoregressive model to render an image's basic layers before deploying a small diffusion model to perfect the details.

One of the critical aspects of HART is its ability to successfully integrate the diffusion process as a finishing touch rather than a primary method, thus preserving speed without sacrificing detail. Tang further elaborates on this, stating to MIT News, "We can achieve a huge boost in terms of reconstruction quality." HART's innovative architecture uses a combination of an autoregressive transformer model with 700 million parameters and a diffusion model with just 37 million parameters, demonstrating significant computational efficiency while maintaining high-quality output.

Beyond image generation for design and simulations, which has implications for training self-driving cars and aiding video game designers, the HART tool could be integrated with language models, opening up new opportunities in fields such as robotics and multimodal AI applications. With promising scalability and generalizability, the team at MIT and NVIDIA is also looking to extend HART's capabilities to video generation and audio prediction tasks. Funding for this research came from various sources, including the MIT-IBM Watson AI Lab and the National Science Foundation, with NVIDIA providing GPU infrastructure support.

Boston-Science, Tech & Medicine