FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao, Fengran Mo, Xueqing Peng, Lingfei Qian, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Jian-Yun Nie

14 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

FinAuditing: How AI Is Tested on Real‑World Financial Reports

Ever wondered if a smart chatbot could spot errors in a company’s financial statements? Scientists have built a new challenge called FinAuditing that puts large language models (the AI behind ChatGPT) to the test with real‑world, tax‑law‑compliant reports. Instead of just reading plain text, the AI must navigate layered tables, numbers, and relationships—much like a detective sorting through a maze of clues. The test checks three things: whether the story in the report makes sense (semantic consistency), whether the links between different sections line up (relational consistency), and whether the math adds up (numerical consistency). Early results show current AIs stumble, dropping up to 90% in accuracy when faced with these complex, multi‑page documents. This tells us that while AI can chat fluently, it still has a long way to go before it can reliably audit finances. As we move toward smarter, regulation‑aware tools, benchmarks like FinAuditing will be the compass guiding us toward safer, more trustworthy financial AI. 🌟

Short Review

Overview

The article introduces FinAuditing, a pioneering benchmark aimed at evaluating large language models (LLMs) in the context of financial auditing. It addresses the complexities associated with Generally Accepted Accounting Principles (GAAP) and eXtensible Business Reporting Language (XBRL) filings, which complicate automation and verification processes. The benchmark delineates three specific subtasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR), each targeting distinct aspects of structured auditing. The findings reveal significant performance gaps in current LLMs, with accuracy drops of up to 60-90% when handling hierarchical multi-document structures, underscoring the need for enhanced financial reasoning systems.

Critical Evaluation

Strengths

One of the primary strengths of the article is its comprehensive approach to addressing the limitations of existing benchmarks in financial auditing. By introducing FinAuditing, the authors provide a structured framework that not only evaluates LLMs on semantic, relational, and numerical consistency but also aligns with the complexities of real-world financial data. The use of real US-GAAP-compliant XBRL filings enhances the relevance and applicability of the benchmark, making it a valuable resource for future research and development in financial intelligence systems.

Weaknesses

Despite its strengths, the article does exhibit some weaknesses. The performance evaluation of various LLMs indicates that even state-of-the-art models struggle significantly with the subtasks defined in FinAuditing. This raises questions about the current capabilities of LLMs in handling structured financial data, suggesting a potential bias towards models that may not be adequately trained for such tasks. Furthermore, the reliance on specific metrics like Hit Rate and Macro F1 may not fully capture the nuances of financial reasoning, potentially limiting the benchmark's effectiveness.

Implications

The implications of this research are profound, as it highlights the urgent need for improved financial reasoning capabilities in LLMs. The findings suggest that without addressing the systematic limitations identified, the deployment of LLMs in financial auditing could lead to significant errors and misinterpretations. This benchmark sets the stage for future advancements in developing trustworthy, structure-aware financial intelligence systems that align with regulatory standards.

Conclusion

In summary, the article presents a critical advancement in the evaluation of LLMs for financial auditing through the introduction of FinAuditing. While it successfully identifies key performance gaps and establishes a foundation for future research, it also underscores the challenges that remain in achieving reliable financial reasoning. The benchmark's availability at Hugging Face further enhances its potential impact on the field, encouraging ongoing exploration and development in this vital area of financial technology.

Readability

The article is structured to facilitate understanding, with clear definitions and a logical flow of information. Each section builds upon the previous one, making it accessible to a professional audience. The use of concise paragraphs and straightforward language enhances engagement, ensuring that readers can easily grasp the complexities of financial auditing and the role of LLMs within it.