Deep Dives

Fine-Tuning LLMs for Domain-Specific Tasks using Unsloth

Discover how to fine-tune language models using Unsloth with this hands-on guide, designed to help you create efficient, domain-specific AI solutions.

Explore more from ADaSci

A Hands-On Guide to Stable Diffusion 3 for Text-to-Image Generation

Hands-On Guide to Running LLMs Locally using Ollama

A Novel Approach for Lookalikes with Multi-Level Sub-Category on Large-Scale Data

HybridRAG: Merging Structured and Unstructured Data for Cutting-Edge Information Extraction

Brain tumor Detection and Classification using EfficientNet-B5 and Attention-based Global Average Pooling with Explainable AI

Weighted clustering on fast sentence embeddings to determine themes from large unstructured data

A Guide to Running LLMs Locally with No-Code Framework Dify

Unlocking the Power of AI: A Deep Dive into Triad Image Classification

Telecom Churn and Valued Customer Retention

Accelerate Your Pandas Workflows with NVIDIA’s cuDF in Google Colab

Fine-tuning Large Language Models (LLMs) has always been resource-intensive, requiring significant computational power and expertise. Unsloth revolutionizes this landscape by making model customization 2x faster while using 70% less memory, without compromising accuracy. This hands-on guide explores how organizations can harness Unsloth’s efficient architecture to adapt models like Llama-3 and Mistral for specialized tasks. We’ll implement a practical project that fine-tunes a Llama model specifically for mental health counseling, demonstrating Unsloth’s capabilities.

Table of Content

Introduction to Unsloth
Practical Implementation
Understanding Key LoRA Settings

Let’s start by understanding what Unsloth is.

Introduction to Unsloth

Unsloth stands at the forefront of LLM fine-tuning optimization, offering groundbreaking efficiency without sacrificing accuracy. Built on OpenAI’s Triton language and featuring a manual backprop engine, it achieves up to 5x faster training speeds in its open-source version and an impressive 30x acceleration with Unsloth Pro. Compatible with modern NVIDIA GPUs and supporting both Linux and Windows (via WSL), Unsloth enables 4-bit and 16-bit QLoRA/LoRA fine-tuning through bitsandbytes. This powerful tool maintains 100% accuracy while dramatically reducing computational overhead, making advanced model customization accessible to a broader range of developers.

Practical Implementation

Step 1 : Install Unsloth and Update the Library

%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

This step installs and updates the unsloth library to the latest version for compatibility.

Step 2 : Import Required Libraries and Load the Model

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None
load_in_4bit = True

fourbit_models = [
    "unsloth/Llama-3.2-1B-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    fourbit_models[0],
    max_seq_length=max_seq_length,
    load_in_4bit=load_in_4bit,
    token =”Enter your Token”
)

The model is loaded with the FastLanguageModel class from unsloth. The model is chosen in its 4-bit quantized form for efficiency, reducing memory usage and computation.

Step 3 : Apply Parameter-Efficient Fine-Tuning (PEFT)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,

    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],

    lora_alpha=32,
    lora_dropout=0.05,
    bias = "none",
    use_gradient_checkpointing = "unsloth", 
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

LoRA (Low-Rank Adaptation) is applied to the model layers, enabling efficient fine-tuning with fewer parameters, reducing computational costs.

Step 4 : Prepare Chat Template for Tokenizer

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

The tokenizer is configured with a specific chat template (“llama-3.1”), ensuring the conversations are formatted appropriately for model input.

Step 5 : Load and Standardize Dataset

We will be using huggingface dataset mental_health_counseling_conversations_sharegpt in this guide.

from unsloth.chat_templates import standardize_sharegpt
from datasets import load_dataset

dataset = load_dataset("Sulav/mental_health_counseling_conversations_sharegpt", split = "train")

Ensure that the dataset is loaded (e.g., using the load_dataset function). Here, we load a sharegpt dataset as an example and standardize it for training. You can replace sharegpt with your own dataset if needed.

Step 6 : Format the Dataset for Training

def formatting_prompts_func(examples):

    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }
pass

dataset = dataset.map(formatting_prompts_func, batched=True)

This function formats the dataset for training. Here, formatted_ids are tokenized inputs, and labels are defined as tokenized outputs (you can customize how the labels are created based on your task).

Step 7 : Set Up the Trainer for Fine-Tuning

from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(

    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False,

    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none",
    ),
)

SFTTrainer is used to fine-tune the model with the dataset. This example includes typical training arguments such as batch size, number of epochs, and logging strategy. The num_train_epochs and evaluation_strategy can be adjusted depending on your dataset and model.

Step 8 : Train on Responses Only (Optional)

from unsloth.chat_templates import train_on_responses_only

trainer = train_on_responses_only(
    trainer,
    instruction_part="<|start_header_id|>user<|end_header_id|>\\n\\n",
    response_part="<|start_header_id|>assistant<|end_header_id|>\\n\\n",
)

If you want to focus the fine-tuning on the model’s responses only, this step refines the training process by emphasizing user-model interactions.

Step 9 : Inspect Tokenized Input

tokenizer.decode(trainer.train_dataset[5]["input_ids"])

This step decodes and prints the tokenized input for a specific example from the dataset, which can be helpful for debugging and verifying tokenization.

space = tokenizer(" ", add_special_tokens=False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

This decodes the tokenized labels and handles special tokens (e.g., padding) in the labels.

Step 10 : Train the Model

trainer_stats = trainer.train()

This command triggers the actual training process using the specified arguments and dataset. The trainer_stats will contain metrics about the training progress.

Step 11 : Prepare for Inference

from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

FastLanguageModel.for_inference(model)  # Enable native 2x faster inference

This step prepares the model for inference, enabling optimizations for faster response generation. It’s important for reducing the latency of generating responses after fine-tuning.

Step 12 : Generate Responses with the Fine-Tuned Model

messages = [
    {"role": "user", "content": "I am not feeling well, I am experiencing extreme anxiety, what to do?"}
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",

).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)

tokenizer.batch_decode(outputs)

Here, a user query is input into the model, and the model generates a response. You can modify the query to test different inputs.

Step 13 : Save the Fine-Tuned Model

model.save_pretrained("lora_model")  # Save the fine-tuned model locally
tokenizer.save_pretrained("lora_model")  # Save the tokenizer for later use

The fine-tuned model and tokenizer are saved to disk for future use, allowing you to reload them for inference later.

Step 14 : Testing the newly trained Fine-Tuned Model

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

messages = [
    {"role": "user", "content": "I am not feeling well, i am experiencing extreme anxiety what to do"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)

_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 150, use_cache = True, temperature = 1.2, min_p = 0.1)

Input : I am not feeling well, I am experiencing extreme anxiety what to do

Result:

I am sorry to hear that you are not feeling well. Anxiety is a normal response to stress and can be a sign of a deeper issue. It is important to take care of yourself and seek professional help if you are not feeling well. You can start by practicing relaxation techniques such as deep breathing, meditation, or yoga. You can also try talking to a friend or family member about your feelings. If you are still not feeling well after trying these things, you may want to consider seeing a mental health professional. They can help you understand the root of your anxiety and provide you with the tools to manage it effectively. They can also provide you with coping strategies and support to help you through the process. I hope this helps.

Understanding Key LoRA Settings

LoRA (Low-Rank Adaptation) parameters are crucial for optimizing model fine-tuning in Unsloth. These key settings control everything from training efficiency to model performance, helping you achieve the perfect balance between computational resources and output quality.

Parameter	Default	Purpose	Impact
r	–	Rank decomposition	Higher = better quality, more compute
lora_alpha	–	Scaling factor	Higher = faster convergence, risk overfitting
lora_dropout	–	Regularization	Higher = prevent overfitting, slower training
learning_rate	2e-4	Update speed	Higher = faster learning, risk instability
weight_decay	0.01	Weight penalty	Higher = reduce overfitting
grad_accumulation	1	Batch processing	Higher = more stability, less memory

Final Words

In this guide, we’ve explored how to fine-tune a language model using Unsloth, demonstrating the power of efficient training techniques for domain-specific applications. Whether you’re working on mental health counseling or any other field, this hands-on approach provides the insights needed to optimize language models. By following these steps, you can create highly specialized, performant models tailored to real-world tasks and improve your AI’s practical utility.

References

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Latest Courses

Fine-Tuning LLMs for Domain-Specific Tasks using Unsloth

Explore more from ADaSci

Table of Content

Introduction to Unsloth

Practical Implementation

Understanding Key LoRA Settings

Final Words

References

Aniruddha Shrikhande

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Agentic AI in Production: Hands-On Workshop

Agentic AI Workforce Readiness Strategies for CXOs

MCP and A2A – The AI Protocols for Next-Gen Agent Ecosystems

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal