Robotics, Vision, and Multimodal Systems are defining the Next Phase of AI

Robert Praas Francisco Ríos Pierre-Alexandre Balland

December 12, 2025 - 2 min read

Task descriptions of Hugging Face models reveal what they are to be used for, and over the years we see fascinating developments. Robotics is the fastest-growing category since 2022. This year, text and image generation are the most common tasks for models. They increasingly converge in multimodal “any-to-any” models, which climbed from the 41st to the 6th rank in only 4 years time.

More task-specific categories like multiple-choice or documentQA models are overtaken by more flexible models: unified architectures that can now handle multiple input and output types within a single model family.

The rise of robotics aligns with what we also see in recent conference signals, including our NeurIPS affiliation and topic analysis. At the same time, NLP continues to be the largest category on the hub, reflecting the fact that language tasks still anchor most model releases, even as new modalities and application areas expand.

Computer vision keeps expanding too, but in a more distributed way. Instead of one single vision task dominating, growth shows up across many specialised categories, reflecting a widening set of use cases and model types within the vision ecosystem.

Note that in earlier years a task like "any-to-any" ranks 10th and 25th for the subset of tasks shown in the top 10 and top 25, whereas it used to have a lower rank, as shown in the top 50.

Scan the QR code to view this story on your mobile device.

OSAImultimodalroboticsvisiontasks

Robotics, Vision, and Multimodal Systems are defining the Next Phase of AI

Related Stories