Back to Stories

GPT-5 leads in key math reasoning benchmarks

Francisco Ríos
AIOpenAIGPT-5


August 9, 2025 - 2 min read

The latest benchmarking data from the Epoch AI Benchmarking Hub places GPT-5 at the front of high-level mathematical reasoning across multiple test categories. Evaluations include the OTIS Mock AIME 2024-2025, GPQA Diamond, and FrontierMath Private benchmarks, datasets designed to measure competition-grade problem-solving skills, complex reasoning chains, and domain-specific knowledge. GPT-5’s medium, high, and mini variants consistently rank in the top tier, surpassing most previous OpenAI models and performing competitively against the best offerings from Anthropic, Google, and xAI.

On the OTIS Mock AIME benchmark, GPT-5 (medium) leads all models tested, achieving a score of 0.872, with the high variant close behind at 0.866. The GPT-5 mini and nano versions also score strongly, maintaining positions in the upper range and outperforming comparable-size models from competitors, which suggests that GPT-5’s reasoning architecture scales well across model sizes without compromising accuracy. The results demonstrate notable improvement over GPT-4 class models, marking a clear generational leap.

Despite its strong showing, GPT-5 does not lead in every benchmark. In the OTIS Mock AIME 2024–2025 leaderboard, Anthropic’s Claude Opus and Claude Sonnet models hold the top three spots, with GPT-5 medium ranking fourth. These cases show that while GPT-5 sets a high standard in many categories, specialized models from other providers can still edge ahead in specific tasks, particularly where architectural optimizations target narrow problem types.

The GPQA Diamond benchmark, which emphasizes graduate-level question answering, shows GPT-5’s medium and high versions in the lead with scores of 0.248, more than double most competitor results. The FrontierMath Private benchmark further confirms GPT-5’s leadership in targeted, high-difficulty math scenarios, where GPT-5 medium and high tie for the highest score with 0.248 points. Such consistent leadership across diverse benchmarks indicates not only raw problem-solving power but also robustness across question types and formats. These results position GPT-5 as the most capable OpenAI model to date for advanced mathematics.

Scan the QR code to view this story on your mobile device