Cohere For AI (C4AI), the non-profit research lab established by AI startup Cohere, has released an unprecedented new open-source large language model named Aya. Covering 101 languages, Aya more than doubles the number of languages supported by previous open-source models. But, why does this matter? And what could this mean for the future of global communication and AI research?
Let’s unpack the release of Aya and its accompanying dataset.
Firstly, this initiative aims to address long-standing gaps in AI language accessibility and cultural representation. By massively expanding multilingual capabilities, Aya provides AI research potential for dozens of underserved languages and communities.
With training data sourced from over 3,000 independent researchers across 119 countries, Aya achieves unprecedented linguistic diversity. It introduces AI competency for more than 50 previously unsupported languages, including Amharic, Uzbek, Somali and many more.
Benchmark testing shows Aya significantly surpassing performance of other open-source multilingual models like mT0 and Bloomz. It scored over 75% on human evaluations against competitors and 80-90% in simulated win rate comparisons.
C4AI has also released Aya Datasets, the largest human-annotated, multilingual collection for AI training to date. The datasets enable superior model performance despite less abundant training data for rare languages.
The implications are profound. For one, Aya brings linguistic inclusivity to the forefront of AI research, offering a voice and digital presence to over 50 languages previously unserved by state-of-the-art models. This inclusivity is not merely symbolic; it’s a stride toward equal representation in the digital arena, ensuring that AI’s reach is as universal as its intended purpose.
Moreover, Aya’s dataset, with approximately 204,000 rare human-curated annotations by fluent speakers in 67 languages, is an important step towards preserving the linguistic diversity that defines our human heritage.
By open-sourcing both the model and datasets under flexible Apache 2.0 licensing, C4AI aims to catalyze AI innovation globally. "We expect Aya's capabilities to continue improving through ongoing collaboration," explained a C4AI team member. "Researchers worldwide can now leverage this tool to progress multilingual AI."
If you want to join this open science initiative and make sure your language is represented, you can visit the Aya Project website. You can also access the Aya model directly via Cohere's Playground, or download the model and dataset.