PRISMM-Bench: A Benchmark of Peer-Review Grounded Multimodal Inconsistencies

Lukas Selch, Yufang Hou, M. Jehanzeb Mirza, Sivan Doveh, James Glass, Rogerio Feris, Wei Lin

23 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

New Test Shows AI Struggles to Spot Mistakes in Science Papers

Ever wondered if a robot could catch the tiny errors that slip into research articles? Scientists have built a fresh challenge called PRISMM‑Bench that does exactly that – it gathers real‑world slip‑ups flagged by human reviewers, from mismatched graphs to confusing equations. Imagine a detective who not only reads the story but also checks the photos, tables, and sketches for clues; that’s what this benchmark asks AI models to do. It matters because today we rely on smart assistants to help researchers write, review, and even discover new ideas. If those assistants miss subtle mismatches, the whole chain of knowledge can wobble. The test puts 21 leading AI models through three rounds: spot the error, suggest a fix, and match the right text with the right figure. The results were sobering – most models scored below 55%, showing they’re still far from being trustworthy scientific partners. This breakthrough shines a light on the road ahead: smarter, more reliable AI that truly understands the full picture of science. 🌟

Short Review

Evaluating Large Multimodal Models in Scientific Reasoning

This insightful article introduces PRISMM-Bench, a novel benchmark designed to rigorously evaluate Large Multimodal Models (LMMs) on their capacity for multimodal scientific reasoning. The core objective is to assess how reliably LMMs can detect and resolve inconsistencies across various modalities—text, figures, tables, and equations—within scientific papers. Utilizing a meticulously curated dataset of 262 real reviewer-flagged inconsistencies, the research employs a multi-stage pipeline involving LLM-assisted filtering and human verification. The study's key finding reveals a strikingly low performance from leading LMMs, ranging from 26.1% to 54.2%, underscoring significant challenges in their ability to understand and reason over complex scientific content.

Critical Assessment of PRISMM-Bench and LMM Performance

Robust Methodology and Real-World Relevance

A significant strength of this work lies in its innovative approach to creating a benchmark grounded in real-world inconsistencies sourced directly from peer reviews. Unlike previous benchmarks that often rely on synthetic errors or isolate single modalities, PRISMM-Bench captures the subtle, domain-specific challenges inherent in scientific communication. The multi-stage curation process, combining LLM assistance with human verification, ensures the dataset's high quality and relevance. Furthermore, the introduction of JSON-based debiasing for multiple-choice questions effectively mitigates linguistic shortcuts, providing a more accurate assessment of LMMs' true reasoning capabilities rather than their ability to exploit answer patterns. The comprehensive evaluation across 21 leading LMMs, three distinct tasks (identification, remedy, pair matching), and varying contextual granularities (Focused, Page, Document) offers a holistic view of model performance.

Current Limitations and Future Challenges for LMMs

Despite the robust evaluation framework, the study highlights substantial weaknesses in current LMM capabilities. The strikingly low performance of even the most advanced proprietary models (up to 54.2%) clearly indicates that LMMs are far from reliably understanding and reasoning over multimodal scientific complexity. The research also reveals that LMMs tend to exploit linguistic biases and natural language shortcuts when not constrained by structured JSON outputs, suggesting a reliance on superficial cues over deep multimodal grounding. Moreover, the observed degradation in performance with expanded contextual granularity points to challenges in scaling reasoning capabilities to full document understanding. This suggests that while "bigger" models might offer some benefits, the "bigger is better" paradigm doesn't automatically translate to robust scientific reasoning, especially when dealing with intricate, cross-modal inconsistencies.

Advancing Trustworthy AI in Scientific Research

This article makes a crucial contribution by exposing the current limitations of LMMs in handling the nuanced, multimodal inconsistencies prevalent in scientific literature. By introducing PRISMM-Bench, it establishes a vital benchmark for future research, setting a new standard for evaluating LMMs' scientific reasoning abilities. The findings serve as a powerful call to action, motivating the development of more sophisticated and trustworthy scientific assistants that can genuinely support researchers. This work is indispensable for anyone interested in the intersection of AI and scientific discovery, providing a clear roadmap for improving the reliability and utility of LMMs in complex, real-world scientific applications.