1. Early science acceleration experiments with GPT-5
Key points:
This paper presents a series of real-life research case studies demonstrating how GPT-5 has supported scientists working in fields such as mathematics, physics, astronomy, computer science, biology and materials science, offering practical contributions rather than superficial suggestions. In these collaborations, researchers describe how the model helped them to explore alternative lines of reasoning, propose feasible approaches, identify relevant background materials, and speed up the initial stages of thinking. They also note instances where the model's outputs required expert correction or deeper scrutiny. GPT-5 frequently reduced the time spent on technical bottlenecks across the projects, acted as a productive sounding board and occasionally supplied genuinely useful insights. Most notably, guided interactions between humans and the model resulted in four new mathematical findings, each independently validated by the authors.
Authors: Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears et al
2. Estimating AI productivity gains from Claude conversations
Key points:
As AI models become more capable, it is important to understand not only how many tasks they can handle, but also the value of each task and the time it saves. Anthropic analysed a large collection of anonymised interactions with Claude, in order to estimate the time that workers could save by using generative AI. Rather than treating all tasks as equivalent, the study emphasises that, even within the same category, different tasks vary greatly in complexity and duration. By capturing these differences, AI can meaningfully reduce the time spent on a wide range of digital and knowledge-intensive work, particularly in areas such as writing, analysis and administrative tasks. When these time savings are scaled across the workforce under assumptions of broad adoption, they indicate that AI could contribute to a notable increase in labour productivity over the next decade. However, the authors emphasise that realising these gains requires accounting for the human effort needed to review and verify AI outputs, thereby highlighting the continued importance of having humans in the loop.
Authors: Anthropic
3. Adversarial poetry as a universal single-turn jailbreak mechanism in Large Language Models
Key points:
This paper examines how converting harmful instructions into poetic form can effectively bypass safety measures LLMs. Through experiments involving 25 different models, the authors demonstrate that rephrasing a risky prompt as a poem frequently results in unsafe outputs. Converting a large set of baseline malicious prompts into verse, either manually or using a standard 'meta-prompt' procedure, dramatically increased the rate at which models disregarded safety rules, achieving an average success rate far higher than that of non-poetic prompts. These vulnerabilities were observed across models from various providers and affected a broad spectrum of risky content, including instructions for harmful acts, manipulation, and prohibited material. The results highlight a systemic weakness: current alignment approaches struggle to detect and block unsafe outputs when instructions are delivered in different linguistic structures, even when the meaning remains unchanged.
Authors: P. Bisconti, M. Galisai, M. Prandi, F. Pierucci, V. Suriani, F. Giarrusso, O. Sorokoletova, D. Nardi
4. What does it take to be a good AI research agent? Studying the role of ideation diversity
Key points:
This paper investigates the factors that contribute to the success of AI research agents when designing and training machine-learning models autonomously. The authors argue that the range of ideas considered by an agent at the outset plays a decisive role. By studying a large collection of agent runs on MLE-bench, spanning different underlying models and scaffolding setups, the authors found that systems capable of generating a broader mix of candidate approaches tended to achieve stronger results. Agents backed by different architectures display distinct levels of ideation variety, and those that ultimately perform best consistently explore a wider set of preliminary concepts. To test whether this relationship is causal rather than coincidental, the researchers deliberately altered the diversity of the agents' initial proposals. Whenever diversity was restricted, the quality of the agents' solutions declined; when diversity was encouraged, however, outcomes improved. The researchers further validated this pattern across multiple evaluation schemes beyond MLE-bench’s standard medal scoring, showing that the connection between idea variety and performance persists across metrics.
Authors: Alexis Audran-Reiss, Jordi Armengol Estapé, Karen Hambardzumyan, Amar Budhiraja, Martin Josifoski, Edan Toledo et al.
5. ToolOrchestra: elevating intelligence via efficient model and tool orchestration
Key points:
The authors introduce ToolOrchestra, a training framework that uses reinforcement learning signals to teach small orchestrators to manage specialised models. Using this method, they have developed an 8-billion-parameter controller called Orchestrator that achieves greater accuracy than previous tool-enabled agents while operating at a much lower cost. Evaluations demonstrate that the system effectively and efficiently handles the complex challenges posed by Humanity’s Last Exam, outperforming larger models. It also achieves a higher score than GPT-5 on tau2-Bench and FRAMES despite requiring a fraction of the computing power, and it maintains reliable performance even when interacting with unfamiliar tools. Key findings show that lightweight orchestrators can outperform larger models on complex reasoning tasks, achieving a better balance between accuracy and resource utilisation. They can also generalise across new tool configurations and match user intentions for tool selection.
Authors: Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong et al.