Multilingual Tokenization Efficiency in Large Language Models: A Study on Indian Languages

Authors: Mohamed Azharudeen M, Balaji Dhamodharan

This study evaluates the tokenization efficiency of multilingual Large Language Models (LLMs), including BLOOM, XGLM, LLaMA, Mistral, and Gemma, focusing on Indian languages such as Assamese, Bengali, Gujarati, Hindi, Tamil, and Urdu. Tokenization, a crucial step for handling diverse character sets, is analyzed for its computational cost and processing time. BLOOM’s tokenizer demonstrated superior efficiency, processing 549,000 characters in 5011.13 seconds at a cost of $11.43, making it the most cost-effective model. This comparative analysis highlights the tokenization impact on computation, offering a baseline for LLM performance in Indian languages. Detailed results are presented in the findings section.

Access this research paper

Picture of Association of Data Scientists

Association of Data Scientists

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

Subscribe to our Newsletter