ComProScanner: A multi-agent based framework for composition-property structured data extraction from scientific literature

Aritra Roy, Enrico Grisan, John Buckeridge, Chiara Gattinoni

26 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

How an AI “Librarian” Turns Complex Science Papers into Easy‑to‑Use Data

What if you could pull hidden numbers and formulas from a research article with just a click? ComProScanner makes that possible. This new AI‑powered platform works like a team of tiny librarians, each one scanning a paper, checking the facts, and neatly filing the chemical recipes and their performance numbers into a tidy spreadsheet. Imagine trying to bake a perfect cake without a recipe—ComProScanner gives scientists the exact “ingredients” and “baking times” they need, but for advanced materials like piezoelectric ceramics.

The tool was tested on a hundred journal articles and beat ten other AI models, achieving an impressive 82% accuracy. That means researchers can now build huge, reliable databases in hours instead of months, speeding up the discovery of new technologies that could power everything from medical devices to clean‑energy sensors. Scientists found that this breakthrough not only saves time but also opens the door for anyone to create machine‑learning models without being a data‑science expert. It’s a game‑changing step toward making cutting‑edge science accessible to all.

The next time you hear about a “new material,” remember there’s a quiet AI team turning dense papers into simple, usable knowledge—bringing the future a little closer to our fingertips.

Short Review

Advancing Scientific Data Extraction with ComProScanner

The advent of advanced large language models (LLMs) has significantly transformed the landscape of knowledge extraction from scientific literature. Despite these breakthroughs, a notable gap persists in accessible, automated tools capable of constructing, validating, and visualizing structured datasets from complex scientific texts. This article introduces ComProScanner, an innovative autonomous multi-agent platform designed to bridge this gap. Leveraging a sophisticated blend of LLM agents, Retrieval-Augmented Generation (RAG), and deep learning, ComProScanner facilitates the extraction, validation, classification, and visualization of machine-readable chemical compositions and properties, seamlessly integrated with synthesis data from journal articles. The framework was rigorously evaluated using 100 journal articles, testing its efficacy across 10 diverse LLMs, both open-source and proprietary, for extracting highly complex compositions associated with ceramic piezoelectric materials and their corresponding piezoelectric strain coefficients (d33). A key finding revealed that DeepSeek-V3-0324 emerged as the top performer, achieving a significant overall accuracy of 0.82, underscoring ComProScanner's potential to revolutionize scientific data compilation.

Critical Evaluation of ComProScanner

Strengths

ComProScanner represents a substantial leap forward in automated scientific data extraction, addressing a critical need for structured knowledge in materials science. Its multi-agent framework, integrating LLMs, RAG, and deep learning, offers a robust and adaptable solution for handling the intricacies of scientific text. The platform's comprehensive evaluation framework, which includes custom weight-based, conventional (Precision, Recall, F1-score), and normalized classification metrics, provides a thorough assessment of its performance. The demonstrated high accuracy, particularly with DeepSeek-V3-0324, highlights its effectiveness in extracting complex material compositions and properties. Furthermore, its availability as a user-friendly Python package via PyPI significantly enhances accessibility for researchers, accelerating materials discovery and database construction. The inclusion of extensive visualization capabilities and superior variable parsing compared to existing material-parsers further solidifies its utility.

Weaknesses

While ComProScanner achieves an impressive 0.82 overall accuracy, there remains a margin for improvement, especially when considering applications requiring near-perfect data fidelity. The evaluation, though comprehensive, was conducted on 100 journal articles, and the framework's scalability and performance on extremely large, diverse datasets beyond this scope warrant further investigation. The generalizability of its high performance to other scientific domains outside of materials science, particularly those with different data structures or terminologies, is also an area for future exploration. Additionally, while some LLMs performed competitively, others like Gemini underperformed, suggesting that the choice of underlying LLM significantly impacts results, which could be a consideration for users without access to top-performing proprietary models.

Conclusion

ComProScanner stands out as a highly valuable and timely contribution to the field of scientific data management. By providing a simple, user-friendly, and readily-usable package, it effectively bridges the gap between unstructured scientific literature and the creation of structured, machine-readable datasets essential for machine learning and deep learning applications. Its robust multi-agent architecture and demonstrated high accuracy in extracting complex materials data position it as a powerful tool for researchers. This framework holds immense potential to accelerate materials discovery and innovation, significantly impacting the efficiency of scientific research and the construction of comprehensive scientific databases.