This study evaluates the tokenization efficiency of multilingual Large Language Models (LLMs), including BLOOM, XGLM, LLaMA, Mistral, and Gemma, focusing on Indian languages such as Assamese, Bengali, Gujarati, Hindi, Tamil, and Urdu. Tokenization, a crucial step for handling diverse character sets, is analyzed for its computational cost and processing time. BLOOM’s tokenizer demonstrated superior efficiency, processing 549,000 characters in 5011.13 seconds at a cost of $11.43, making it the most cost-effective model. This comparative analysis highlights the tokenization impact on computation, offering a baseline for LLM performance in Indian languages. Detailed results are presented in the findings section.
Access this research paper
-
Lattice | Volume 5 Issue 2₹1,260.00