This Week's 5 Most Notable AI Research Papers - Week 47

Gaia Cavaglioni

November 21, 2025 - 6 min read

1. PAN: a world model for general, interactable, and long-horizon world simulation

Key points:

PAN defines a general world-model supporting long-horizon, action-conditioned simulation.
It fuses latent reasoning (via LLM backbone) with visual realism (video-diffusion decoder).
Experiments show superior performance in forecasting and simulative reasoning across diverse environments.
The architecture points toward AI that plans and imagines, not just reacts.

This paper introduces PAN (Predictive Action Network), a model designed to simulate future events over extended time span while enabling users to influence the progression of the scene via natural language instructions. PAN combines a latent-dynamics module built on a large language model with a video-generation decoder that transforms abstract predictions into coherent visual sequences. Having been trained on extensive datasets that pair actions with videos from a variety of environments, the model can predict outcomes based on user-specified actions and maintain visual and logical consistency over long periods of time. Experiments demonstrate that PAN significantly outperforms prior world models and video generators in terms of forecasting accuracy, multi-step reasoning, and action-driven simulation.

Read the full article here

Authors: PAN Team Institute of Foundation Models - Jiannan Xiang, Yi Gu, Zihan Liu, Zeyu Feng, Qiyue Gao, Yiyan Hu, Benhao Huang, Guangyi Liu, Yichi Yang, Kun Zhou, Davit Abrahamyan, Arif Ahmad et al.

2. AA-Omniscience: evaluating cross-domain knowledge reliability in large language models

Key points:

New benchmark evaluates both recall and calibration (Omniscience Index: –100 to +100).
Only a handful of frontier models achieved positive scores, many still underperform.
Domain shifts reveal major reliability weaknesses.
High broad-capability scores do not guarantee trustworthy knowledge.
Domain-aware evaluation and calibration must be central to governance and deployment.

This study introduces AA-Omniscience, a benchmark designed to evaluate the reliability with which large language models recall factual knowledge and recognise their limitations. The dataset comprises approximately 6,000 expert-sourced questions spanning 42 topics across six economically significant fields. These questions are paired with an evaluation metric, the Omniscience Index, which rewards abstention when uncertain while penalising incorrect answers and excessive confidence. Using this framework, the authors evaluate a range of leading LLMs and find that only a small subset achieves a net-positive reliability score, with even the strongest model achieving only a modest value. Performance varies sharply across domains, revealing that a model that excels in one area may be unreliable in others. These findings highlight a persistent issue in current evaluation standards: generic benchmarks often mask weaknesses in factual grounding and uncertainty management. The study emphasises the necessity of domain-specific assessments and improved calibration techniques for the safe use of LLMs in high-stakes environments, where accuracy and self-awareness are paramount.

Read the full article here

Authors: Declan Jackson, William Keating, George Cameron, Micah Hill-Smith

3. AI agents, productivity, and higher-order thinking: early evidence from software development

Key points:

Coding agents boosted weekly merge output by nearly 40%.
More senior users were notably more receptive to agent-generated code.
Use expanded beyond engineering roles to multidisciplinary teams.
Programming moved from detailed typing to giving structured instructions.
Higher-order reasoning and oversight skills are becoming more critical.

This study explores how the introduction of AI coding agents has reshaped software productivity and the cognitive demands placed on workers. Using data from a development platform in which users (ranging from experienced engineers to product and design staff) could generate code via natural-language instructions, the analysis reveals significant behavioural and output changes following the agent's adoption as the primary means of code generation. Experienced developers provided clearer prompts and were more likely to include planning, probably enabling the agent’s outputs to align more closely with their intentions. They were also more likely to accept the system's suggestions. There was a significant increase in activity across the platform, rising by almost 40%, particularly after the agent became the platform's default code generation method. The paper highlights that agentic tools can speed up production and transform the way people approach planning, evaluating solutions, and creating clear instructions, which, in this context, means that software development is increasingly about conceptualising the functionality of software rather than physically putting it together.

Read the full article here

Authors: Suproteem K. Sarkar

4. GPT-5.1-Codex-Max system card

Key points:

First native model to operate on multiple contextual windows through the compaction process.
Trained on real workflows (PRs, code review, etc).
More reasoning efficiency: uses fewer tokens, performs better.
Safety-first: sandboxing, log tracking and optional network access.
Evaluated for cybersecurity and biology; self-improvement remains limited.

OpenAI has unveiled GPT-5.1-Codex-Max, a new agentic coding model designed specifically for extended software engineering workflows. Relying on an upgraded reasoning stack, it introduces a method that compresses and reorganises information, enabling the model to continue working effectively even when handling extremely large contexts. The system card also outlines OpenAI’s safety framework, which includes built-in restrictions against harmful actions, secure execution environments and robust oversight tools. Capability tests carried out under OpenAI’s Preparedness Framework demonstrate that, while not achieving the highest classification, the model performs well in cybersecurity and is highly capable in biological domains with appropriate safeguards. However, it does not demonstrate top-tier capability in areas associated with AI self-improvement.

Read the full article here

Authors: OpenAI

5. Depth Anything 3: recovering the visual space from any views

Key points:

A unified transformer backbone can reconstruct 3D geometry from any number of images.
Simplifying the target to depth-ray improves generalisation and training efficiency.
DA3 outperforms prior state-of-the-art models by large margins on pose and geometry tasks.
It also surpasses its predecessor on single-image depth estimation.
The work paves the way for general-purpose 3D perception models supporting robotics, AR/VR and beyond.

The paper introduces Depth Anything 3 (DA3), a model designed to extract scene geometry from any collection of images, regardless of whether the camera viewpoints are known. Instead of relying on numerous specialised modules, the authors demonstrate that a general-purpose transformer paired with a single depth-ray prediction objective can handle a wide range of 3D perception tasks. DA3 is trained through a teacher–student setup and evaluated using a newly assembled suite of tests covering pose recovery, multi-view geometry and rendering quality. In these tests, the model shows significant improvements: it is more accurate in measuring camera position (improvement of around 44%) and more precise in its spatial reconstruction (improvement of over 25% compared to the best previous model). DA3 is also better than its predecessor at estimating the depth of a single image, and it uses only publicly available datasets for training.

Read the full article here

Authors: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang

Scan the QR code to view this story on your mobile device.

PANAI agentsLLMs

This Week's 5 Most Notable AI Research Papers - Week 47

Related Stories