Academic AI research is increasingly built on open-source foundations. The models and datasets researchers cite reveal who's driving innovation, and where the ecosystem is heading. A year-end analysis of Hugging Face Daily Papers offers a lens into this landscape.
Daily Papers, curated primarily by AK (Ahsen Khaliq) and the research community, showcase cutting-edge AI research on Hugging Face. By tracking which models and datasets appear in these papers, we can measure the impact of open-source AI in academic work and the preferences shaping the open-source community. The data captures 6,524 papers featured in the year 2025, referencing 9,154 unique models and 3,206 unique datasets.
Examining the top 20 providers, who have contributed 447 datasets and 2,176 models, reveals a surprisingly diverse ecosystem. No single organisation dominates. OpenMed is in the lead with an overall score of 10.4%, thanks to a strong showing in model references at 11.7%. They are followed by Qwen (representing the Alibaba consortium) at 9.7%, Unsloth (9.7%), MTEB (8.5%), NVIDIA (8.5%), and OpenGVLab (7.5%). The long tail is extensive, showing that organizations contribute hundreds of models and datasets that researchers actively use.
Thus, this distribution challenges the typical tech narrative. Private companies, academic institutions, individual developers and international consortia all participate meaningfully. NVIDIA's presence, on the other hand, spans not only compute infrastructure but also open models. OpenGVLab contributes vision-focused research from Shanghai, and individual contributors such as Mungert (based in the UK) and the Japan-based LLM-jp consortium comprise an active international community of over 1,000 researchers from universities and corporations, demonstrating global participation.
However, the dataset landscape shows more concentration. MTEB, a framework for evaluating embeddings and retrieval systems across languages, significantly dominates with 35.9% of references. This is followed by a more distributed set of contributors, including Common Pile (researchers collecting public domain data for LM training), NVIDIA, and GreenNode (an AI infrastructure company in Southeast Asia).
A striking pattern emerges: researchers cite models far more frequently than datasets (roughly four times as many model references as dataset references, in the top 20 shown in the Sunburst chart). Yet, specific dataset frameworks like MTEB achieve outsized influence, dominating over a third of all dataset citations. This suggests that foundational evaluation tools matter enormously for standardising research practices.
The medical AI sector shows particularly strong open-source adoption. OpenMed, dedicated to healthcare AI, maintains a significant presence, with their focus on biomedical named entity recognition and clinical reasoning reflecting academic medicine's embrace of open tools.
What's most notable isn't who leads, but how distributed leadership is. In the community-curated open-source AI research, Microsoft is absent and Google barely makes an appearance, ranking 20th in terms of model occurrences (56), albeit through their DeepMind Lab. Instead, impact is fragmented across dozens of contributors. This isn't a sign of weakness - it's a sign of resilience. When many organisations build foundational tools, the ecosystem becomes harder to control and easier to extend.
The international dimension reinforces this. However, this could change as foundation models become more expensive to train. For now, though, academic AI research draws from a genuinely diverse technical base, where a non-profit medical AI initiative and an AI infrastructure start-up in Southeast Asia can be as important as major tech companies. This diversity may well be open source's most important contribution.
Note: The outer ring displays the top 10 most-referenced datasets/models for the top 20 providers, sized by citation count. The CSV export includes all items from the top 20 providers.