Cybench, a yardstick AI outgrew

May 29, 2026 - 2 min read

Two years ago, GPT-4o was the best AI model on Cybench with a 12.5% success rate. Today, Claude Mythos Preview (reportedly) solves all of them: 100%, pass@1, no hints.

Cybench is a benchmark built around professional Capture the Flag competitions: adversarial puzzles where models exploit web vulnerabilities, crack cryptographic schemes, reverse-engineer binaries, gain shell access. For comparison, the hardest tasks took human expert teams nearly 25 hours to solve.

Claude 3.7 Sonnet crossed 17%. The Opus 4.x line pushed through the 60s, 70s, and 90s over successive releases. Opus 4.6 hit ~100% on pass@30, eventually getting there given enough tries. Then Mythos hit 100% on the first attempt.

Mythos reportedly found thousands of high-severity vulnerabilities in real production systems autonomously; a 27-year-old OpenBSD RCE bug, a 17-year-old FreeBSD flaw. That's what makes the benchmark score feel quaint. Anthropic's own Opus 4.6 system card already declared Cybench saturated and "no longer useful for tracking capability progression." The eval outlived its usefulness right as the capability crossed into genuinely concerning territory.

This runs in both directions. Anthropic gated Mythos behind Project Glasswing specifically to put these capabilities in the hands of defenders first. But the capability exists now regardless. An AI benchmark going from 12.5% to 100% in only two years is impressive, and also a clear indication that AI capability progress is moving faster than our ability to measure it.

Scan the QR code to view this story on your mobile device.

CybersecurityAnthropicBenchmark

Cybench, a yardstick AI outgrew

Related Stories