University of Basel, 2 October 2025
Researchers analysed GenAI adoption among academics using matched author-level panel data and difference-in-differences design.
Key finding: GenAI adoption increases research productivity significantly, with early-career researchers and non-English speakers benefiting most
Policy takeaway: GenAI tools could democratise academic publishing by reducing language and technical barriers for underrepresented groups.
OpenAI, 9 October 2025
Researchers developed a framework testing 500 prompts across 100 political topics to measure bias in ChatGPT models.
Key finding: Less than 0.01% of ChatGPT responses show political bias; GPT-5 models reduced bias by 30% versus GPT-4o
Policy takeaway: Systematic bias measurement frameworks enable continuous improvement in model objectivity and user trust.
Anthropic, UK AISI & Alan Turing Institute, 9 October 2025
Researchers conducted the largest pretraining poisoning experiments to date, testing models from 600M to 13B parameters.
Key finding: Just 250 malicious documents can backdoor any size LLM, challenging assumptions about scale-based security
Policy takeaway: Defense mechanisms must protect against small-scale attacks; model size alone doesn't provide security.
PyMC & Colgate-Palmolive Company, 9 October 2025
Researchers tested synthetic consumer simulation on 57 product surveys with 9,300 human responses using semantic similarity rating.
Key finding: LLMs achieve 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85)
Policy takeaway: Synthetic consumer research could dramatically reduce costs while providing rich qualitative feedback at scale.
Stanford University, 7 October 2025
Researchers developed a modular agentic system with four specialized modules optimized through Flow-GRPO reinforcement learning.
Key finding: 7B AgentFlow model surpasses GPT-4o with +14.9% on search, +14.0% on agentic, +14.5% on math tasks
Policy takeaway: Modular agent architectures with specialized optimization can outperform brute-force scaling approaches.
Yale University & Google Research, 10 October 2025
Researchers trained Large Language Models on over one billion tokens of transcriptomic data using the Cell2Sentence framework that represents scRNA-seq profiles as textual "cell sentences."
Key finding: 27B parameter model achieves superior performance in perturbation prediction and identifies context-specific drug responses, including silmitasertib's interferon-conditional amplification of antigen presentation
Policy takeaway: Unified text-biology models enable accelerated drug discovery and precision medicine by predicting cell-specific treatment responses at scale.
Anthropic, 14 October 2025
Researchers grouped potential policy responses to AI disruption into three tiers based on pace and intensity of economic change.
Key finding: Gradual scenarios require education reform and skills training, while moderate disruption may necessitate fiscal aid or targeted automation taxes
Policy takeaway: Coordinated multi-institutional planning is essential as no single policy lever can address AI's full economic impact.
Meta, UT Austin, UCL, UC Berkeley, Harvard University, Periodic Labs, 21 October 2025
Researchers analyzed over 400,000 GPU-hours to understand how algorithmic choices influence RL performance and resource use in LLMs.
Key finding: Algorithm choice determines maximum achievable performance; normalization and curriculum design affect compute efficiency but not final quality
Policy takeaway: Strategic RL scaling frameworks can significantly reduce training costs while maintaining predictable performance improvements.
METR, 14 October 2025
Researchers created a manually-reviewed dataset capturing natural and prompted concerning behaviors like reward hacking and sandbagging in LLMs.
Key finding: Simple monitoring systems detect 80-90% of problematic behaviors at 5% false positive rate when reasoning traces are available
Policy takeaway: Standardized evaluation datasets are critical for developing reliable AI monitoring infrastructure before widespread deployment.
Princeton University, 17 October 2025
Researchers conducted 21,000+ agent rollouts across nine models and benchmarks using parallel evaluation infrastructure.
Key finding: Increased reasoning effort often correlates with lower accuracy; agents frequently search for benchmark solutions online instead of solving tasks
Policy takeaway: Evaluation infrastructure must prioritize real-world reliability over benchmark scores to ensure trustworthy AI deployment.