Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun

24 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

Detective AI Solves Video Mysteries with Real‑World Clues

Ever wondered how a computer could watch a short film and piece together the story like a seasoned detective? Conan is a new AI system that does exactly that – it watches video frames, picks out the crucial clues, and reasons step‑by‑step to reach the right answer. Imagine watching a mystery movie and pausing only at the moments that matter, just as a sleuth would examine fingerprints or a hidden note. Scientists built a massive library of 91,000 example puzzles so the AI could learn when to keep searching for evidence and when to call the case solved. This “identify‑reason‑act” approach lets the system avoid wild guesses and stay grounded in what it actually sees. The result? Conan outperforms previous models by more than 10 % on tough video‑reasoning tests and even handles longer movies with ease. This breakthrough shows how machines can think more like humans, turning raw footage into clear, trustworthy conclusions. The future may bring AI assistants that help us untangle complex visual information in everyday life, one frame at a time. 🌟

Short Review

Overview

The article presents Conan, an innovative framework designed for evidence-grounded multi-step video reasoning, addressing significant challenges faced by multimodal large language models (MLLMs). By integrating frame identification, evidence reasoning, and action decision-making, Conan leverages a newly constructed dataset, Conan-91K, and employs a unique Identification-Reasoning-Action (AIR) Reinforcement Learning with Verifiable Rewards (RLVR) framework. The findings indicate that Conan outperforms existing models, achieving state-of-the-art accuracy improvements of over 10% on various benchmarks. Additionally, the framework demonstrates robust generalization capabilities for long-video understanding tasks.

Critical Evaluation

Strengths

One of the primary strengths of Conan lies in its comprehensive approach to multi-step reasoning. The integration of the Conan-91K dataset, which facilitates extensive training on evidence reasoning, enhances the model's ability to identify and utilize relevant frames effectively. Furthermore, the progressive cold-start strategy combined with the AIR RLVR framework allows for a nuanced training process that adapts to the complexities of video data. The experimental results are compelling, showcasing Conan's superiority over established models, including GPT-4o, in both accuracy and reasoning capabilities.

Weaknesses

Despite its advancements, the article does not extensively address potential limitations of the Conan framework. For instance, while the model excels in evidence-grounded reasoning, the reliance on automated dataset generation may introduce biases or inaccuracies that could affect performance in real-world applications. Additionally, the complexity of the training process may pose challenges for replication and scalability in diverse settings, which warrants further exploration.

Implications

The implications of Conan's findings are significant for the field of multimodal reasoning. By demonstrating enhanced performance in video reasoning tasks, this framework could pave the way for more sophisticated applications in areas such as automated video analysis, surveillance, and interactive media. The ability to generalize effectively to long-video understanding tasks also suggests potential for broader applicability across various domains.

Conclusion

In summary, the article presents a noteworthy contribution to the field of video reasoning through the introduction of the Conan framework. Its innovative methodologies and impressive performance metrics highlight the potential for advancing multimodal large language models. While there are areas for further investigation, particularly regarding the robustness of the training data and model scalability, Conan sets a new benchmark for future research in evidence-grounded reasoning.