Fine-tuning Large Language Models (LLMs) has always been resource-intensive, requiring significant computational power and expertise. Unsloth revolutionizes this landscape by making model customization 2x faster while using 70% less memory, without compromising accuracy. This hands-on guide explores how organizations can harness Unsloth’s efficient architecture to adapt models like Llama-3 and Mistral for specialized tasks. We’ll implement a practical project that fine-tunes a Llama model specifically for mental health counseling, demonstrating Unsloth’s capabilities.
Table of Content
- Introduction to Unsloth
- Practical Implementation
- Understanding Key LoRA Settings
Let’s start by understanding what Unsloth is.
Introduction to Unsloth
Unsloth stands at the forefront of LLM fine-tuning optimization, offering groundbreaking efficiency without sacrificing accuracy. Built on OpenAI’s Triton language and featuring a manual backprop engine, it achieves up to 5x faster training speeds in its open-source version and an impressive 30x acceleration with Unsloth Pro. Compatible with modern NVIDIA GPUs and supporting both Linux and Windows (via WSL), Unsloth enables 4-bit and 16-bit QLoRA/LoRA fine-tuning through bitsandbytes. This powerful tool maintains 100% accuracy while dramatically reducing computational overhead, making advanced model customization accessible to a broader range of developers.
Practical Implementation
Step 1 : Install Unsloth and Update the Library
%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
This step installs and updates the unsloth library to the latest version for compatibility.
Step 2 : Import Required Libraries and Load the Model
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
fourbit_models = [
"unsloth/Llama-3.2-1B-bnb-4bit",
]
model, tokenizer = FastLanguageModel.from_pretrained(
fourbit_models[0],
max_seq_length=max_seq_length,
load_in_4bit=load_in_4bit,
token =”Enter your Token”
)
The model is loaded with the FastLanguageModel class from unsloth. The model is chosen in its 4-bit quantized form for efficiency, reducing memory usage and computation.
Step 3 : Apply Parameter-Efficient Fine-Tuning (PEFT)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.05,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
use_rslora = False,
loftq_config = None,
)
LoRA (Low-Rank Adaptation) is applied to the model layers, enabling efficient fine-tuning with fewer parameters, reducing computational costs.
Step 4 : Prepare Chat Template for Tokenizer
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
The tokenizer is configured with a specific chat template (“llama-3.1”), ensuring the conversations are formatted appropriately for model input.
Step 5 : Load and Standardize Dataset
We will be using huggingface dataset mental_health_counseling_conversations_sharegpt in this guide.
from unsloth.chat_templates import standardize_sharegpt
from datasets import load_dataset
dataset = load_dataset("Sulav/mental_health_counseling_conversations_sharegpt", split = "train")
Ensure that the dataset is loaded (e.g., using the load_dataset function). Here, we load a sharegpt dataset as an example and standardize it for training. You can replace sharegpt with your own dataset if needed.
Step 6 : Format the Dataset for Training
def formatting_prompts_func(examples):
convos = examples["conversations"]
texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
return { "text" : texts, }
pass
dataset = dataset.map(formatting_prompts_func, batched=True)
This function formats the dataset for training. Here, formatted_ids are tokenized inputs, and labels are defined as tokenized outputs (you can customize how the labels are created based on your task).
Step 7 : Set Up the Trainer for Fine-Tuning
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 60,
learning_rate = 2e-4,
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "outputs",
report_to = "none",
),
)
SFTTrainer is used to fine-tune the model with the dataset. This example includes typical training arguments such as batch size, number of epochs, and logging strategy. The num_train_epochs and evaluation_strategy can be adjusted depending on your dataset and model.
Step 8 : Train on Responses Only (Optional)
from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
trainer,
instruction_part="<|start_header_id|>user<|end_header_id|>\\n\\n",
response_part="<|start_header_id|>assistant<|end_header_id|>\\n\\n",
)
If you want to focus the fine-tuning on the model’s responses only, this step refines the training process by emphasizing user-model interactions.
Step 9 : Inspect Tokenized Input
tokenizer.decode(trainer.train_dataset[5]["input_ids"])
This step decodes and prints the tokenized input for a specific example from the dataset, which can be helpful for debugging and verifying tokenization.
space = tokenizer(" ", add_special_tokens=False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])
This decodes the tokenized labels and handles special tokens (e.g., padding) in the labels.
Step 10 : Train the Model
trainer_stats = trainer.train()
This command triggers the actual training process using the specified arguments and dataset. The trainer_stats will contain metrics about the training progress.
Step 11 : Prepare for Inference
from unsloth.chat_templates import get_chat_template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
This step prepares the model for inference, enabling optimizations for faster response generation. It’s important for reducing the latency of generating responses after fine-tuning.
Step 12 : Generate Responses with the Fine-Tuned Model
messages = [
{"role": "user", "content": "I am not feeling well, I am experiencing extreme anxiety, what to do?"}
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)
Here, a user query is input into the model, and the model generates a response. You can modify the query to test different inputs.
Step 13 : Save the Fine-Tuned Model
model.save_pretrained("lora_model") # Save the fine-tuned model locally
tokenizer.save_pretrained("lora_model") # Save the tokenizer for later use
The fine-tuned model and tokenizer are saved to disk for future use, allowing you to reload them for inference later.
Step 14 : Testing the newly trained Fine-Tuned Model
if False:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model)
messages = [
{"role": "user", "content": "I am not feeling well, i am experiencing extreme anxiety what to do"},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize = True,
add_generation_prompt = True, # Must add for generation
return_tensors = "pt",
).to("cuda")
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 150, use_cache = True, temperature = 1.2, min_p = 0.1)
Input : I am not feeling well, I am experiencing extreme anxiety what to do
Result:
I am sorry to hear that you are not feeling well. Anxiety is a normal response to stress and can be a sign of a deeper issue. It is important to take care of yourself and seek professional help if you are not feeling well. You can start by practicing relaxation techniques such as deep breathing, meditation, or yoga. You can also try talking to a friend or family member about your feelings. If you are still not feeling well after trying these things, you may want to consider seeing a mental health professional. They can help you understand the root of your anxiety and provide you with the tools to manage it effectively. They can also provide you with coping strategies and support to help you through the process. I hope this helps.
Understanding Key LoRA Settings
LoRA (Low-Rank Adaptation) parameters are crucial for optimizing model fine-tuning in Unsloth. These key settings control everything from training efficiency to model performance, helping you achieve the perfect balance between computational resources and output quality.
Parameter | Default | Purpose | Impact |
r | – | Rank decomposition | Higher = better quality, more compute |
lora_alpha | – | Scaling factor | Higher = faster convergence, risk overfitting |
lora_dropout | – | Regularization | Higher = prevent overfitting, slower training |
learning_rate | 2e-4 | Update speed | Higher = faster learning, risk instability |
weight_decay | 0.01 | Weight penalty | Higher = reduce overfitting |
grad_accumulation | 1 | Batch processing | Higher = more stability, less memory |
Final Words
In this guide, we’ve explored how to fine-tune a language model using Unsloth, demonstrating the power of efficient training techniques for domain-specific applications. Whether you’re working on mental health counseling or any other field, this hands-on approach provides the insights needed to optimize language models. By following these steps, you can create highly specialized, performant models tailored to real-world tasks and improve your AI’s practical utility.