Marru(convert): Structure Data Creation from Unstructured Text for Fine-tuning Large Language Models in Indian Languages

Author(s): Mohamed Azharudeen M, Balaji Dhamodharan

This paper, we present a methodology for converting unstructured text into a structured question-and-answer format, specifically targeting 11 Indian languages. The scarcity of question- and-answer datasets for these languages poses a significant challenge for fine-tuning Large Language Models (LLMs) for specific tasks. We employed a ternary quantized model based on the LLaMA-2 architecture to achieve this conversion efficiently. Our model, BitNet 1.58, leverages a unique computation paradigm, reducing memory consumption and enhancing computational efficiency. The dataset was created in the Alpaca format and trained on 47 million data tokens over 3 epochs. Evaluation challenges were addressed using [3]Grice’s Maxims and AI-assisted evaluation techniques. The research demonstrates significant potential for improving data quality and expanding usability across more languages.

Access this Lattice Journal:

Picture of 晓军

晓军

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.