DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

24 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

DeepWideSearch: Why Search Bots Still Miss the Mark

Ever wondered why your digital assistant sometimes gives you a vague answer instead of the exact detail you need? DeepWideSearch shines a light on this mystery. Researchers built a new test where a smart agent must both “dig deep” into complex facts and “scan wide” across tons of information—think of a detective who has to read dozens of books while solving a tricky case. The test includes 220 real‑world questions from 15 different fields, from market trends to everyday curiosities. Even the most advanced bots managed to answer correctly only about 2 % of the time, revealing a huge challenge in combining deep reasoning with broad searching. The study also uncovered four common slip‑ups: not pausing to reflect, relying too much on what they already “know,” missing key sources, and getting overwhelmed by too much context. This breakthrough shows we still have a long road ahead before AI can truly think like a human researcher. As we keep improving these agents, the day may come when a simple chat can give you a full, accurate picture in an instant. 🌟

Short Review

Overview

The article introduces DeepWideSearch, a pioneering benchmark designed to evaluate the capabilities of information-seeking agents in performing both deep reasoning and wide-scale information collection. This benchmark addresses a critical gap in current agent architectures, particularly in real-world applications such as market analysis and business development. Through the development of two innovative methods, Deep2Wide and Wide2Deep, the authors curated a dataset comprising 220 questions across 15 diverse domains. Experimental results reveal that even state-of-the-art agents achieve a mere 2.39% average success rate, underscoring significant challenges in integrating depth and width in information-seeking tasks.

Critical Evaluation

Strengths

One of the primary strengths of this study is the introduction of a comprehensive benchmark that effectively combines depth and width in information retrieval. The use of two distinct methods for dataset construction enhances the robustness of the evaluation, allowing for a nuanced assessment of agent performance. Additionally, the incorporation of new evaluation metrics, such as Column-F1 and Core Entity Accuracy, provides a more detailed understanding of agent capabilities, particularly in complex tasks.

Weaknesses

Despite its strengths, the study has notable weaknesses. The low success rate of 2.39% indicates that current agents struggle significantly with the benchmark's demands, revealing a potential overreliance on internal knowledge and insufficient retrieval capabilities. Furthermore, the high computational costs associated with the evaluation process may limit accessibility for broader research applications. The identified failure modes, including lack of reflection and context overflow, suggest that existing architectures may require substantial redesign to meet the benchmark's challenges.

Implications

The implications of this research are profound, as it sets a new standard for evaluating information-seeking agents. By publicly releasing the DeepWideSearch benchmark, the authors aim to catalyze future research focused on developing more capable and robust agents. This could lead to significant advancements in various fields, including artificial intelligence and data science, where effective information retrieval is crucial.

Conclusion

In summary, the article presents a valuable contribution to the field of information retrieval through the introduction of DeepWideSearch. By highlighting the limitations of current agent architectures and proposing a rigorous evaluation framework, it paves the way for future innovations in information-seeking technologies. The findings underscore the need for ongoing research to enhance agent performance, ultimately aiming for more effective solutions in real-world applications.