Beyond coding tests: GDPval’s workplace benchmark

Katja Spanz

October 14, 2025 - 3 min read

Traditionally, generative AI models have been evaluated through academic tests or coding challenges, what was left out however, is the evaluation of how these models might support people in the work they do on a daily basis. OpenAI´s newest benchmark, GDPval, is aimed at filling this gap by testing models on real-life professional tasks, providing a clearer idea of performance on economically valuable work.

How does it work?

Models complete real-world tasks; domain experts then blindly compare each model’s output with an expert human deliverable and judge which is better. Runs use one-shot prompts and standardised instructions.

What does it cover?

GDPval covers 44 occupations across 9 major industries. It includes 1.320 specialist tasks e.g. drafting a legal brief, producing an engineering drawing, handling a customer-support exchange, or compiling a nursing care plan. Each of these tasks includes realistic context and materials developed in cooperation with experienced professionals from diverse backgrounds to allow for and maximise representativeness.

Why does this matter?

Tests such as GDPval provide a more realistic evaluation of AI model usage by moving closer to day-to-day work, assessing documents, spreadsheets, presentations and diagrams, rather than answering quiz questions.

What are some early results?

Some initial findings suggest that AI models can match (and in some cases even exceed) expert performance for well-defined and repetitive task and, most importantly, at a lower price-tag. In blind tests on the GDPval gold set, industry experts often rated outputs from some leading AI models better or on par with human-produced work, varying by task. Claude Opus 4.1 performed best in particular on aesthetics (e.g. formatting, layout), and GPT-5 stood out in particular on accuracy (e.g. finding domain-specific knowledge). What strikes the most is the speed of performance of these models – frontier models can complete GDPval tasks roughly 100x faster and 100x cheaper than industry experts. However, these numbers only cover model run time and API pricing, leaving aside human oversight, iteration and integration of steps into real workflows.

What does this mean for workers?

AI can definitely take on routine and repetitive tasks, creating additional time for professionals to focus on creative, judgement-based work, making it a valuable complementary tool, supporting, rather than replacing human expertise.

The message we get from GDPval is clear: treat AI as a complement to human expertise. Let generative AI models handle the repeatable tasks, and keep human oversight for risk, context and ethics. This also means investing in people, building AI skills and training on how to use and integrate these tools into everyday work. With structured training, clear guidelines and team routines embedding these tools into daily workflows, AI can augment and support roles, improve service quality and build trust.

Read the full article here: https://openai.com/index/gdpval/

Scan the QR code to view this story on your mobile device.

Beyond coding tests: GDPval’s workplace benchmark

Related Stories