A super interesting experiment has been conducted by the nof1.ai team - one that evaluates the performance of leading LLMs in live financial environments through Alpha Arena, the first live benchmark designed to evaluate large language models as autonomous trading agents in real financial markets. Each model is given $10.000 of actual capital, following identical prompts. The funny thing is that unlike static academic tests, Alpha Arena immerses AIs in dynamic and unpredictable settings, allowing for an authentic assessment of their ability to generate and manage trades autonomously. They argue markets serve as the ultimate intelligence test, demanding adaptation, foresight, disciplined risk management, etc.
The competition features six state-of-the-art LLMs: Claude 4.5 Sonnet, DeepSeek V3.1 Chat, Gemini 2.5 Pro, GPT 5, Grok 4, and Qwen 3 Max. Each operates independently with full transparency, as all decisions and trades are publicly traceable. The goal is to maximize risk-adjusted returns over the duration of what the nof1.ai team called Season 1, which ran from October 18 to November 3, 2025. In this season, all models received the same market inputs, prompts, and trading universe (limited to six major cryptocurrencies) and could only buy, sell, hold, or close positions.
The authors of the experiment disclosed some early analysis, disclosed in their blog. Some agents exhibited consistent directional biases (Grok and Gemini tended toward short positions, whereas Claude Sonnet 4.5 maintained a predominantly long stance). Qwen demonstrated the largest position sizing and the highest self-reported confidence levels, while GPT-5 and Gemini took smaller, more conservative positions. Holding times and trade frequencies also varied: Gemini was the most active, while Grok tended to hold positions for longer durations. Several models struggled with internal consistency, occasionally misreading their own prior outputs or invalidating previously stated exit plans, signaling weaknesses in temporal reasoning and execution discipline.
The first two days were pretty much stable, but soon divergence emerged. Qwen, Deepseek, Grok and Claude began generating consistent gains, while Gemini and GPT-5 quickly fell behind, loses from which they never recovered. By October 27, Deepseek reached the peak performance of the season at over $23.000, closely followed by Qwen. The chinese models kept dominating for a little while after this, but then the big collapse began: on October 29, all the LLMs saw a dramatic drop on their holded stock, which started a trend of loosing stock for all of the LLMs. By the experiment’s end, Qwen 3 Max held the strongest remaining position, while GPT-5 ended as the worst performer.
Beyond the numbers, the experiment raises interesting questions about the maturity of autonomous AI agents in finance. One of the most common arguments supporting AI agents is their supposed ability to manage money effectively, and yet, these results suggest that such confidence may be premature. The systems demonstrated potential but also extreme volatility: success one day, collapse the next. It is very likely that if the test lasted longer, some portfolios might have approached zero. In the end, is clear that at least under the conditions of Season 1 of this experiment, AI Agents are not really good at trading. It may seem so at first, looking at the huge highs of models like Deepseek and Qwen, but in this kind of activities consistency is key. In the end, Alpha Arena makes one thing clear: despite their sophistication, today’s LLMs remain far from mastering real-world trading... though future seasons, under different experiment conditions, might reveal whether that limitation is temporary or fundamental.