The Rise of Multilingual LLMs: Cohere Unveils Aya 23

Cohere unveils Aya 23, advanced multilingual models, trained on 23 languages, enhancing global AI communication.

Cohere for AI announced Aya 23, a family of large language models focusing on multilingual capabilities based on 23 languages. Aya 23 model was released in 2 variants by Cohere – Aya 23 8B and 35B. Aya 23 model family is primarily based on Cohere’s Command models, which are pre-trained using a data mixture comprising 23 languages, and Aya multilingual instruction style collection. 

Table of Content

  1. Overview of Aya 23 and its Significance in Multilinguality
  2. Aya Family of Models
  3. Benchmark Results

Overview of Aya 23 and its Significance in Multilinguality

Aya 23 is a SOTA expansion based on the Aya 101 model. Aya 23’s variants – 8B and 35B parameter sizes, outperform previous multilingual language models such as Aya 101, Gemma, Mistral and Mixtral based on an extensive range of discriminative and generative tasks. Cohere4AI released the open weights for both 8B and 35B Aya 23 models on Hugging Face for experimentation, research and audit (https://huggingface.co/spaces/CohereForAI/aya-23?ref=cohere-ai.ghost.io).  

The Aya 23 models are trained on data from 23 languages using a standard decoder-only transformer architecture having key features such as SwiGLU activation function, Rotary positional embeddings (ROPE) for better context, 256k vocabulary size BPE(Byte-Pair Encoding) tokenizer trained on a balanced subset of pre-training data. These pre-trained models are then instruction fine-tuned on a mixture of multilingual data. 

Multilingual Benchmark Results covering 5 task categories from 8 datasets – Aya 23 outperforms Aya-101-13B, Bacterian-X-7B, Gemma-1.1-7B-it, Mistral-7B-Inst-v0.2 and Mixtral-8x7B-Inst models

Aya Family of Models

The Aya initiative consists of 3 Models – Aya 23-8B, Aya 23-35B and Aya 101. The Aya 101 launched in Feb’2024 has a core strength of handling 101 languages. Aya 101 model is a finetuned 13B mT5 model using an instruction mixture based on data sourced from a variety of sources such as:

  1. xP3x Dataset (Cross-Lingual Public Pool of Prompt eXtended) – xP3x is a mixture of tasks in 277 languages with English prompts, A subset of xP3x was considered which focused on 101 languages.
  2. Data Provenance Collection – It comes under the Data Provenance Initiative.
  3. Aya Collection – It includes 19 translated datasets covering 101 languages. 

Training Data Sources used for instruction finetuning Aya (yellow shows multilingual templates, blue shows human annotations, orange is synthetic data and pink shows machine translation-based data. 

The Aya 23 is an experiment in shifting from breadth to depth. While Aya 101 showcased breadth due to its 13-billion parameter model, it had shortcomings – As more languages are added to a multilingual LLM, the performance of individual languages can often decrease. This is known as the Curse of Multilinguality. Aya 23 on the other hand, balances breadth and depth, exploring the impact of allocating more resources to fewer languages to alleviate the issue and lead to better performance over the original Aya 101 and other widely used models such as Gemma, Mistral and Mixtral. 

Benchmark Results

Aya 23-35B outperforms all models in discriminative tasks with an average of 70.8%, In Multilingual MMLU (Massive Multitask Language Understanding) Aya 23-35B achieved an average of 48.2% accuracy across all languages and the highest score in 11 out 14 languages for its class. When it comes to MGSM (Multilingual Mathematical Reasoning), Aya 23 models outperform all-in-class baselines, which indicates strong mathematical reasoning ability across languages. Aya 23-35B achieved a score of 53.7 compared to the Mixtral-8x7B-Instruct-v0.1 model.  

Multilingual Benchmarks

% Win Rates

Final Words

Aya 23 greatly improves the performance for a subset of 23 languages, this gives rise to the multilingual technologies empowering the multilingual world. It’s a significant step towards making AI technology accessible to a wider audience based on a wide array of languages. The extensive evaluation demonstrated the high performance of these models and by releasing the model weights, Cohere has made sure this work can be continued further. 

References

  1. Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model
  2. xP3x – Cross-Lingual Public Pool of Prompt eXtended 
  3. The Data Provenance Initiative
  4. Lifting the Curse of Multilinguality by Pre-training Modular Transformers
  5. Enhanced Transformer using Rotary Position Embeddings

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.