FrontierMath benchmark shows AI struggling with advanced mathematics

Leon Oliver Wolf

July 16, 2025 - 1 min read

Epoch AI's FrontierMath benchmark reveals current AI models perform poorly on advanced mathematical problems that take expert mathematicians hours to days to solve. The benchmark features several hundred unpublished, expert-level mathematics problems across four difficulty tiers, with Tier 4 representing research-level mathematics.

Recent results show the best-performing model, o4-mini, achieves only 6.3% accuracy on Tier 4 problems, while most leading models including Claude Sonnet 4, Grok-3, and GPT-4.1 score 0%. Even on the easier Tiers 1-3 covering undergraduate through early graduate level, performance remains limited.

The benchmark raises questions about its own longevity. Fields medalists Terence Tao and Timothy Gowers characterized Tier 3 problems as "exceptionally challenging," with Tao predicting they will "resist AIs for several years." However, UCLA professor Igor Pak suggests some problems "will be solved within 5, 10 years" while others "might not be able to solve for the next 50 years."

This leads to a foundational question: what will be the gold standard for measuring AI mathematical reasoning, and will there ever be one that endures as AI capabilities advance?

Source: Epoch

Scan the QR code to view this story on your mobile device