MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Young-Jun Lee, Byung-Kwan Lee, Jianshu Zhang, Yechan Hwang, Byungsoo Ko, Han-Gyu Kim, Dongyu Yao, Xuankun Rong, Eojin Joo, Seung-Ho Han, Bowon Ko, Ho-Jin Choi

22 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

MultiVerse: The New Test That Makes AI See and Talk Like Us

Ever imagined chatting with a robot that can not only talk but also *see* the world around you? Scientists have introduced MultiVerse, a fresh benchmark that puts vision‑and‑language models through real‑life, multi‑turn conversations. Think of it as a friendly quiz where the AI must answer four‑step dialogues about everything from simple facts to solving math puzzles and even writing code. With 647 mini‑conversations drawn from 12 popular tests, the dataset covers 484 different tasks, giving the models a true “talk‑and‑look” challenge. This breakthrough matters because it pushes AI closer to the way we naturally interact—showing a picture, asking a follow‑up question, and getting a clear answer. It’s like teaching a child to describe a photo while answering a story‑time question. Early results show even the smartest systems hit only about a 50% success rate, highlighting how much room there is to grow. Understanding and improving this ability will make future assistants more helpful in homes, schools, and workplaces, turning sci‑fi dreams into everyday reality. 🌟

Short Review

Unveiling MultiVerse: A New Benchmark for Multi-Turn VLM Conversations

This research introduces MultiVerse, a novel and comprehensive multi-turn conversation benchmark designed to rigorously evaluate Vision-and-Language Models (VLMs). Addressing the limitations of existing single-turn datasets, MultiVerse comprises 647 diverse dialogues, 484 tasks, and 25 image domains, covering a wide spectrum from factual knowledge to advanced reasoning like mathematics and coding. The study employs a sophisticated checklist-based evaluation method, leveraging GPT-4o to assess VLM performance across 37 key aspects, including perceptual accuracy and linguistic clarity. Key findings reveal that even the most advanced VLMs, such as GPT-4o, achieve only a 50% success rate in these complex interactions, highlighting the benchmark's challenging nature. Crucially, the research demonstrates that providing full dialogue context significantly enhances performance, underscoring the vital role of in-context learning for VLM development.

Evaluating MultiVerse: A Critical Perspective on VLM Interaction

Strengths This research introduces a comprehensive benchmark, MultiVerse, which significantly advances the evaluation of Vision-and-Language Models in complex multi-turn interactions. The dataset's construction is notably robust, involving meticulous image collection, GPT-4o-driven dialogue generation, and rigorous manual review for naturalness and correctness. Furthermore, the checklist-based evaluation, leveraging GPT-4o across 37 key aspects, provides a sophisticated and granular assessment of VLM performance, addressing a critical gap in existing benchmarks. The findings compellingly highlight the importance of in-context learning and dialogue history for improving VLM reasoning capabilities.

Weaknesses A primary concern lies in the dual reliance on GPT-4o for both generating the multi-turn conversations and acting as the automated evaluator. This dependency could potentially introduce a form of circular bias, where the evaluation criteria might inadvertently favor responses aligned with GPT-4o's inherent linguistic and reasoning patterns. While the manual review process is thorough, it is inherently resource-intensive and may still carry subjective biases, despite efforts to mitigate them. Additionally, the 50% success rate for even the strongest models, while demonstrating the benchmark's challenge, also underscores the current limitations of VLMs, suggesting that the benchmark might be exceptionally difficult for practical applications.

Implications MultiVerse sets a new, higher standard for evaluating Vision-and-Language Models, pushing the boundaries of what is considered effective multi-turn interaction. Its challenging nature and detailed evaluation methodology will undoubtedly guide future VLM development, particularly in enhancing contextual understanding, advanced reasoning, and the effective utilization of dialogue history. The findings underscore the urgent need for models that can robustly handle complex, real-world conversational scenarios, thereby accelerating progress towards more capable and human-like AI systems.

MultiVerse's Impact on Advancing VLM Capabilities

MultiVerse represents a significant contribution to the field of Vision-and-Language Models, providing an indispensable tool for assessing and advancing their multi-turn interaction abilities. By exposing the current limitations of even state-of-the-art models, this benchmark offers a clear roadmap for future research and development. Its emphasis on contextual understanding and in-context learning is crucial for building more intelligent and adaptable VLMs capable of navigating the complexities of real-world applications. This work is poised to accelerate progress towards more sophisticated and human-like conversational AI.