1. PAN: a world model for general, interactable, and long-horizon world simulation
Key points:
This paper introduces PAN (Predictive Action Network), a model designed to simulate future events over extended time span while enabling users to influence the progression of the scene via natural language instructions. PAN combines a latent-dynamics module built on a large language model with a video-generation decoder that transforms abstract predictions into coherent visual sequences. Having been trained on extensive datasets that pair actions with videos from a variety of environments, the model can predict outcomes based on user-specified actions and maintain visual and logical consistency over long periods of time. Experiments demonstrate that PAN significantly outperforms prior world models and video generators in terms of forecasting accuracy, multi-step reasoning, and action-driven simulation.
Authors: PAN Team Institute of Foundation Models - Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad et al.
2. AA-Omniscience: evaluating cross-domain knowledge reliability in large language models
Key points:
This study introduces AA-Omniscience, a benchmark designed to evaluate the reliability with which large language models recall factual knowledge and recognise their limitations. The dataset comprises approximately 6,000 expert-sourced questions spanning 42 topics across six economically significant fields. These questions are paired with an evaluation metric, the Omniscience Index, which rewards abstention when uncertain while penalising incorrect answers and excessive confidence. Using this framework, the authors evaluate a range of leading LLMs and find that only a small subset achieves a net-positive reliability score, with even the strongest model achieving only a modest value. Performance varies sharply across domains, revealing that a model that excels in one area may be unreliable in others. These findings highlight a persistent issue in current evaluation standards: generic benchmarks often mask weaknesses in factual grounding and uncertainty management. The study emphasises the necessity of domain-specific assessments and improved calibration techniques for the safe use of LLMs in high-stakes environments, where accuracy and self-awareness are paramount.
Authors: Declan Jackson, William Keating, George Cameron, Micah Hill-Smith
3. AI agents, productivity, and higher-order thinking: early evidence from software development
Key points:
This study explores how the introduction of AI coding agents has reshaped software productivity and the cognitive demands placed on workers. Using data from a development platform in which users (ranging from experienced engineers to product and design staff) could generate code via natural-language instructions, the analysis reveals significant behavioural and output changes following the agent's adoption as the primary means of code generation. Experienced developers provided clearer prompts and were more likely to include planning, probably enabling the agent’s outputs to align more closely with their intentions. They were also more likely to accept the system's suggestions. There was a significant increase in activity across the platform, rising by almost 40%, particularly after the agent became the platform's default code generation method. The paper highlights that agentic tools can speed up production and transform the way people approach planning, evaluating solutions, and creating clear instructions, which, in this context, means that software development is increasingly about conceptualising the functionality of software rather than physically putting it together.
Authors: Suproteem K. Sarkar
4. GPT-5.1-Codex-Max system card
Key points:
OpenAI has unveiled GPT-5.1-Codex-Max, a new agentic coding model designed specifically for extended software engineering workflows. Relying on an upgraded reasoning stack, it introduces a method that compresses and reorganises information, enabling the model to continue working effectively even when handling extremely large contexts. The system card also outlines OpenAI’s safety framework, which includes built-in restrictions against harmful actions, secure execution environments and robust oversight tools. Capability tests carried out under OpenAI’s Preparedness Framework demonstrate that, while not achieving the highest classification, the model performs well in cybersecurity and is highly capable in biological domains with appropriate safeguards. However, it does not demonstrate top-tier capability in areas associated with AI self-improvement.
Authors: OpenAI
5. Depth Anything 3: recovering the visual space from any views
Key points:
The paper introduces Depth Anything 3 (DA3), a model designed to extract scene geometry from any collection of images, regardless of whether the camera viewpoints are known. Instead of relying on numerous specialised modules, the authors demonstrate that a general-purpose transformer paired with a single depth-ray prediction objective can handle a wide range of 3D perception tasks. DA3 is trained through a teacher–student setup and evaluated using a newly assembled suite of tests covering pose recovery, multi-view geometry and rendering quality. In these tests, the model shows significant improvements: it is more accurate in measuring camera position (improvement of around 44%) and more precise in its spatial reconstruction (improvement of over 25% compared to the best previous model). DA3 is also better than its predecessor at estimating the depth of a single image, and it uses only publicly available datasets for training.
Authors: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang