Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong

20 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

Explore to Evolve: How Smart Web Agents Learn to Gather Knowledge

Ever wondered how a computer could become a better “research detective”? Scientists have discovered a new way for web‑agents to not only hunt for facts but also to stitch them together into clear answers. First, the agent roams the internet like an explorer, picking up reliable clues from websites, files, and even images. Then, using those clues, it builds its own “recipe” for combining information—choosing from a toolbox of simple logical steps—to create a trustworthy question‑and‑answer pair. Think of it as a chef who gathers fresh ingredients from the market and then invents a new dish by mixing them in just the right way. This “Explore to Evolve” method let the team generate a massive collection of real‑world examples, training the agents to match the performance of top‑tier AI models. The result? A new generation of assistants that can truly **aggregate** information, turning scattered data into clear insights—something even the biggest AI systems still struggle with. Imagine a future where your digital helper can read, compare, and summarize everything you need, just like a seasoned researcher, making everyday decisions easier and more informed. That’s the power of smarter web agents.

Short Review

Advancing Deep Research Web Agents Through Enhanced Information Aggregation

Current deep research web agents often prioritize information-seeking, neglecting robust information aggregation crucial for in-depth analysis. This limitation hinders their ability to synthesize knowledge from diverse sources. To address this, a novel "Explore to Evolve" paradigm is introduced, enabling scalable creation of verifiable training data. This involves proactive online exploration to gather evidence, followed by the agent's self-evolution of an aggregation program. By composing operations from 12 high-level logical types, the system generates verifiable Question-Answering (QA) pairs. This produced WebAggregatorQA, a dataset of 10,000 samples across 50,000 websites. WebAggregator foundation models were developed; the 8B variant matches GPT-4.1, and the 32B variant surpasses GPT-4.1 by over 10% on GAIA-text, closely approaching Claude-3.7-sonnet. A human-annotated WebAggregatorQA evaluation split reveals that even leading models like Claude-3.7-sonnet (28%) and GPT-4.1 (25.8%) struggle, highlighting a significant bottleneck in current agent capabilities.

Critical Evaluation of WebAggregator's Impact

Strengths

The "Explore to Evolve" paradigm is a significant strength, offering a highly scalable data generation method for complex information aggregation tasks, directly addressing data scarcity. The creation of WebAggregatorQA is a major contribution, providing a challenging, human-annotated benchmark that explicitly targets the information aggregation bottleneck often overlooked by existing evaluations. The detailed methodology for generating diverse QA pairs, including robust quality control, ensures the dataset's utility. The demonstrated superior performance of WebAggregator models against established LLMs like GPT-4.1 validates this novel approach's effectiveness in improving agent capabilities for complex synthesis.

Weaknesses

Despite advancements, the research underscores the inherent difficulty of complex aggregation challenges. Even with WebAggregator models showing improvements, and when agents successfully retrieve all necessary references, they still struggle considerably on the WebAggregatorQA benchmark. The low scores of leading models like Claude-3.7-sonnet and GPT-4.1 indicate a substantial gap remains in achieving truly robust information aggregation. This suggests potential generalization limitations for tasks requiring more nuanced or novel aggregation strategies beyond the 12 defined logical types. Furthermore, the initial agent used for trajectory sampling is based on GPT-4.1, which could introduce a degree of foundational model dependency, potentially influencing the scope of explored aggregation behaviors.

Conclusion

This research makes a pivotal contribution by directly confronting the critical challenge of information aggregation in deep research web agents. The "Explore to Evolve" paradigm and the WebAggregatorQA dataset provide an innovative and scalable framework for both training and rigorously evaluating agents on complex synthesis tasks. By demonstrating that even state-of-the-art models falter on these challenges, the study effectively redefines the frontier for intelligent web agents, emphasizing that retrieval alone is insufficient. The WebAggregator models showcase a promising path forward, significantly advancing web agents' capabilities in synthesizing knowledge. This work not only offers a valuable future research benchmark but also clearly identifies information aggregation as a key area for continued focus and innovation in developing truly intelligent and autonomous research assistants.