A new class of machine learning algorithms is slowly but steadily being introduced: World Foundation Models (WFMs). They aim to generate realistic environments with working physics, as opposed to text, pictures and videos with entertaining but less useful hallucinations.
Last year, OpenAI already described their video generator Sora as having world-model-like capabilities. Last week, Google announced Genie 3, and this week, NVIDIA released updates of their Cosmos series of world models and Omniverse libraries. Existing text-to-video algorithms, such as Runway, Veo 3 or Hunyuan, were a first step towards environment simulation, now these new models are specifically designed for 3D technologies.
Before, language models and multimodal models (with text, image and sound capabilities) used as much information from the internet as possible to provide sensible answers. Scaling laws started stalling, and the question arose whether training data was becoming the bottleneck for smarter AI. An abundance of data can be obtained in the real world—especially through video—or by creating synthetic data based on the data we already have. That's precisely the innovation in WFMs; it's their ability to create synthetic text, image, and video datasets for training robots and AI agents.
The idea that synthetic data is a game changer has been contested, although with better performing foundation models, better synthetic data has appeared. Similarly, the world is now eagerly awaiting to see the impact of world models on robotics. A robot should better navigate an environment with more elaborate training data and more understanding of the different aspects of the environment they operate in. However, with new data and better data quality, how quickly will robots become more useful? Autonomous vehicles, general-purpose agents, and digital twins are all examples related to robotics which were promised to be great breakthroughs by now, and they have disappointed so far. On the other hand, driverless cabs are already available in multiple countries, chatbots are actively searching the web for better answers and aren’t we all using maps as digital twins for navigation?