CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou, Gunveer Gujral, Yinglong Xia, Shane Moon, Nicolas Scheffer, Nirav Shah, Eun Chang, Yue Liu, Florian Metze, Tammy Stark, Zhaleh Feizollahi, Andrea Jessee, Mangesh Pujari, Ahmed Aly, Babak Damavandi, Rakesh Wanga, Anuj Kumar, Rohit Patel, Wen-tau Yih, Xin Luna Dong

01 Nov 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How Smart Glasses Could Soon Answer All Your Everyday Questions

Imagine looking at a coffee shop through your smart glasses and instantly getting the menu, the Wi‑Fi password, or even a quick fact about the art on the wall. Scientists have created a new benchmark called CRAG‑MM that lets AI systems practice exactly this kind of real‑world chat. The test set includes more than 6,000 pictures taken from a wearer’s point of view, plus thousands of back‑and‑forth questions that mimic what you might actually ask while walking down the street. Think of it like a “training ground” where AI learns to pull the right info from the web or a knowledge graph, even when the photo is blurry or the topic is obscure. So far, even the best commercial tools only answer correctly about one‑third of the time, showing there’s huge room for growth. In a recent competition, clever teams boosted performance by nearly 30%, proving the challenge is both tough and exciting. This breakthrough could soon turn your wearable into a truly helpful companion, making everyday moments a little smarter and a lot more connected. Stay tuned—the future of on‑the‑go knowledge is just around the corner.

Short Review

Advancing Multi-Modal RAG for Wearable AI: A Deep Dive into CRAG-MM

This article introduces CRAG-MM, a groundbreaking benchmark designed to address the critical need for comprehensive evaluation in Multi-Modal Retrieval-Augmented Generation (MM-RAG) systems, particularly within the context of wearable devices. Recognizing the transformative potential of smart glasses and similar technologies for real-time information seeking, the authors developed CRAG-MM to simulate complex, multi-turn conversations based on egocentric visual data. The benchmark comprises a diverse dataset featuring 6.5K image-question-answer triplets and 2K visual multi-turn conversations across 13 domains, including 6.2K egocentric images that mimic wearable captures. It meticulously incorporates real-world challenges such as varying image quality, diverse question types, entity popularity, and information dynamism. Through three distinct tasks—single-source augmentation, multi-source augmentation, and multi-turn conversations—paired with dedicated retrieval corpora and APIs, CRAG-MM provides a robust framework. Initial evaluations reveal that current RAG approaches, including state-of-the-art industry solutions, achieve limited truthfulness, underscoring significant room for improvement in this vital field.

Critical Evaluation of the CRAG-MM Benchmark

Strengths of the CRAG-MM Benchmark

The CRAG-MM benchmark stands out for its exceptional comprehensiveness and real-world relevance, filling a significant gap in MM-RAG research. Its focus on wearable AI scenarios and egocentric images directly addresses emerging technological needs, providing a practical foundation for future development. The dataset's diversity, encompassing various image quality issues, question types, and conversational turns, ensures a rigorous and fair evaluation of MM-LLMs. Furthermore, the inclusion of both Knowledge Graph and web-sourced retrieval content, alongside distinct tasks for single-source, multi-source, and multi-turn QA, offers a multifaceted assessment. The benchmark's early impact, evidenced by its role in KDD Cup 2025 and the subsequent performance improvements by winning solutions, highlights its immediate value to the scientific community.

Weaknesses and Challenges Revealed by CRAG-MM

While CRAG-MM is a strength in itself, the initial findings it presents underscore significant weaknesses in current MM-RAG capabilities. The low truthfulness scores (32-45%) for both straightforward and state-of-the-art solutions reveal that existing models struggle considerably with the complexities inherent in multi-modal, multi-turn conversations. Specific challenges identified include high rates of hallucinations, difficulties with image quality variations, robust entity recognition, and complex reasoning across conversational turns. These limitations suggest that current MM-RAG systems are not yet equipped to reliably handle the nuanced information retrieval demands of real-world wearable applications.

Implications for Future MM-RAG Research

CRAG-MM's introduction has profound implications for the future of MM-RAG research. By clearly delineating the current performance ceiling and highlighting specific areas of struggle, the benchmark provides a clear roadmap for innovation. It will undoubtedly serve as a crucial tool for researchers and developers aiming to build more robust, truthful, and context-aware MM-RAG systems. The benchmark's design encourages the development of novel approaches that can better integrate visual and textual information, manage conversational context, and mitigate issues like hallucinations, ultimately accelerating progress towards more intelligent and reliable wearable AI applications.

Conclusion

The CRAG-MM benchmark represents a pivotal contribution to the field of Multi-Modal Retrieval-Augmented Generation. By providing a meticulously designed, comprehensive, and challenging evaluation framework tailored for wearable device scenarios, it not only exposes the current limitations of MM-RAG systems but also galvanizes the research community to innovate. Its early success in fostering competition and driving performance improvements underscores its immediate and lasting value, setting a new standard for advancing multi-modal AI in practical, real-world applications.