Aya, developed by Cohere For AI, is a state-of-the-art multilingual large language model (LLM), that emerged from a global open science initiative to advance machine learning in diverse languages. The project's core objective is to address and diminish the linguistic bias prevalent in current natural language processing (NLP) technologies, where a disproportionate focus on English has led to an underrepresentation of numerous global languages. The ultimate goal is build a series of state-of-the-art multilingual generative language models that leverage the collective wisdom and contributions of people from around the globe.
Aya 101 is the first model released in the series, and has Apache 2.0 license. It is a 13 billion parameter model. Accompanying Aya is a large multilingual instruction dataset with 513 million data points across 114 languages, aimed at addressing the needs of underserved languages. Initiated with over 3,000 researchers from 119 countries, Aya seeks to democratize AI technology globally, particularly for languages that have been largely ignored by existing models.
Here are some key features of Aya model:
Following are the technical specifications of the model.
The Aya model supports a broad range of languages, classified into higher-, mid-, and lower-resourced categories. This includes widely spoken languages like English, Chinese, and Spanish, and less commonly represented languages such as Afrikaans, Amharic, and Azerbaijani. The model also covers languages with various scripts and families, highlighting its comprehensive multilingual capabilities.
Aya supports 101 languages, with varying level of 'resourcedness'. Some of the languages it has high-resourcedness are:
Full list of languages Aya 101 supports:
Afrikaans · Albanian · Amharic · Arabic · Armenian · Azerbaijani · Basque · BelarusianBengali · Bulgarian · Burmese · Catalan · Cebuano · Chichewa Chinese · Corsican · CzechDanish · Dutch · English · Esperanto Estonian · Filipino · Finnish · French · GalicianGeorgian · German · Greek · Gujarati · Haitian Creole · Hausa · Hawaiian · Hebrew · HindiHmong · Hungarian · Icelandic · Igbo · Indonesian · Irish · Italian Japanese · JavaneseKannada · Kazakh · Khmer · Korean · Kurdish · Kyrgyz · Lao · Latin · Latvian · LithuanianLuxembourgish · Macedonian Malagasy · Malay · Malayalam · Maltese · Maori · MarathiMongolian Nepali · Norwegian · Pashto · Persian · Polish · Portuguese · PunjabiRomanian · Russian · Samoan · Scottish Gaelic · Serbian · Shona · Sindhi Sinhala · Slovak · Slovenian · Somali · Sotho · Spanish · Sundanese Swahili · Swedish · Tajik · Tamil · TeluguThai · Turkish · Ukrainian · Urdu Uzbek Vietnamese · Welsh · West Frisian · Xhosa · Yiddish · Yoruba · Zulu
To download the model and check its dataset, see the links below: