AMO-Bench: Large Language Models Still Struggle in High School Math Competitions

Shengnan An, Xunliang Cai, Xuezhi Cao, Xiaoyu Li, Yehao Lin, Junlin Liu, Xinxuan Lv, Dan Ma, Xuanlin Wang, Ziwen Wang, Shuang Zhou

31 Oct 2025 3 min read

AI-generated image, based on the article abstract

Quick Insight

When AI Meets Olympiad Math: The New AMO‑Bench Challenge

Ever wondered if a chatbot could ace the toughest high‑school math contests? Researchers have unveiled AMO‑Bench, a fresh set of 50 brand‑new problems that rival the difficulty of the International Mathematical Olympiad. These puzzles are so fresh that even the smartest large language models can’t rely on memorized answers—they have to think from scratch. The result? The best AI today solves just over half of them, while most linger below 40 % accuracy. Imagine a sprinter who can dash 100 meters in a flash but still struggles to climb a steep mountain—that’s the gap we’re seeing in AI math reasoning. This “breakthrough” benchmark shines a light on how far we still have to go, and it gives scientists a clear track to train smarter, more reasoning‑capable machines. Every new problem solved brings us a step closer to AI that can truly reason like a human, turning today’s curiosity into tomorrow’s everyday tools. 🌟

Short Review

Unveiling Advanced Mathematical Reasoning Gaps in Large Language Models with AMO-Bench

This insightful article introduces AMO-Bench, a novel and highly challenging benchmark designed to rigorously evaluate the advanced mathematical reasoning capabilities of Large Language Models (LLMs). Addressing the performance saturation observed in existing benchmarks, AMO-Bench comprises 50 entirely original, human-crafted problems, meticulously validated by experts to meet or exceed International Mathematical Olympiad (IMO) difficulty standards. The study's primary goal is to expose the current limitations of top-tier LLMs in complex mathematical problem-solving. Key findings reveal that even the most advanced models struggle significantly, with the best performer achieving only 52.4% accuracy, underscoring a substantial gap in their reasoning abilities. Furthermore, the research highlights a promising scaling trend, indicating that increased test-time compute can enhance performance, suggesting a clear path for future LLM development.

Critical Evaluation of AMO-Bench for LLM Assessment

Strengths

The development of AMO-Bench represents a significant stride in LLM evaluation. Its core strength lies in its rigorous construction pipeline, ensuring all 50 problems are original and expert-validated to IMO difficulty standards, effectively preventing data memorization and benchmark saturation. This meticulous approach guarantees a truly challenging and fair assessment of advanced reasoning. The decision to require only a final answer, coupled with robust parser-based or LLM-based grading, ensures high accuracy (99.2%) and enables efficient, automatic evaluation. Moreover, the detailed analysis of performance across diverse LLMs, including the observation of scaling trends with increased output length and inference budget, provides invaluable insights into the inherent potential and current limitations of these models.

Weaknesses

While AMO-Bench excels in its specific domain, a few considerations warrant discussion. The focus on Olympiad-level problems, while crucial for advanced reasoning, might not encompass the full spectrum of mathematical challenges LLMs could face in real-world applications beyond competitive math. Additionally, while the final-answer format facilitates robust automatic grading, it inherently bypasses the evaluation of the step-by-step reasoning process. This could potentially obscure deeper insights into how LLMs arrive at their solutions or where their logical breakdowns occur, which a proof-based evaluation might reveal. Lastly, while highly curated, the dataset size of 50 problems, though challenging, might limit the breadth of mathematical domains explored at this advanced level.

Implications

AMO-Bench carries profound implications for the future of AI research and LLM development. By clearly demonstrating the significant room for improvement in mathematical reasoning, it establishes a new, higher bar for evaluating advanced AI capabilities. The observed scaling trend, where performance improves with increased test-time compute, offers a clear and actionable direction for researchers to enhance LLM architectures and inference strategies. Furthermore, the finding that open-source models are progressively closing the performance gap with proprietary ones is highly encouraging, fostering greater accessibility and innovation within the AI community. This benchmark is poised to accelerate advancements in building more robust and intelligent language models.

Conclusion

AMO-Bench stands as a pivotal contribution to the field of artificial intelligence, offering a much-needed, highly rigorous benchmark for assessing advanced mathematical reasoning in LLMs. Its innovative design and comprehensive evaluation methodology effectively highlight the current limitations of even the best models, while simultaneously pointing towards promising avenues for future research and development. This work not only provides a critical tool for measuring progress but also inspires the next generation of LLMs capable of tackling truly complex intellectual challenges, ultimately pushing the boundaries of what AI can achieve in scientific problem-solving.