
AI labs are chasing ever-larger models that are trained using proprietary data troves, which is driving up costs, energy demands and legal risks around copyright.
Founded in 2024 by Pierre-Carl Langlais, Ivan Yamshchikov and Anastasia Stasenko, Paris-based Pleias crafts energy-efficient and high-performing language models using exclusively open datasets. Their flagship product is Common Corpus, the largest open dataset at 2 trillion tokens.
Compliant with EU legislation, these models excel at enterprise tasks such as document processing and RAG. They run efficiently on modest hardware, including user GPUs and CPUs, while preserving traceability, compliance and strong reasoning across languages without the typical drop in performance for non-English content. Their SLMs with fewer than 1 billion parameters lead open-source benchmarks for efficiency.
Funding is supporting the scaling of multilingual synthetic data pipelines such as SYNTH, as well as vertical models for sectors ranging from healthcare to government.
Sources: Pleias | Hugging Face
Founders: Pierre-Carl Langlais, Ivan Yamshchikov, Anastasia Stasenko