In the rapidly changing field of **artificial intelligence **the need for speed in **model inference** is essential. Large Language Models (LLMs) are strong instruments that assist with text generation, question answering, and other activities. But because of their size—billions of parameters—they frequently result in slower inference times, necessitating optimization. This practical manual will explore quantization, a powerful technique that can increase LLM inference speed without significantly compromising their accuracy.

## Table of Content

- Understanding LLM Inference Optimization
- Understanding Quantization
- Step-by-Step Guide to Implementing Quantization
- Measuring Performance Improvements
- Common Challenges and Solutions
- Alternative Approaches for LLM Inference Optimization

Let’s start with understanding the LLM inference optimization.

## Understanding LLM Inference Optimization

The capacity of a LLM model to make inferences based on context clues or past knowledge is known as **LLM inference**, and the speed and accuracy with which LLM models can accomplish this distinguishes them from other models. Frequently, LLMs are far too big to operate on consumer hardware. These models typically require **GPUs** with a lot of VRAM to speed up inference because they have billions of parameters.

As a result, an increasing amount of research has been conducted to reduce the size of these models using several methods.LLM Inference Optimization is the term for the methods that are employed to achieve this. **Quantization** is a key method in this area that will be covered in this blog.

## Understanding Quantization

The process of reducing the precision of the numbers used to represent model parameters is known as **quantization**. It is the process of converting a wide range of values into a more manageable set, usually from 32-bit floating-point numbers to lower precision forms (e.g., 16-bit, 8-bit integers).

Deploying LLMs in resource-constrained environments is made possible by this reduction in precision, which results in smaller models with faster computations (because lower precision arithmetic operations are faster, resulting in reduced latency during inference) and lower power consumption.

Fig.1 Example Matrix

A random matrix of weights with three-decimal precision, like [[0.123, 0.567], [0.987, 0.543]], can be quantized in the context of large language models (LLMs) by rounding each value to one-decimal precision, producing a simpler matrix like [[0.1, 0.6], [1.0, 0.5]]. Another example that is comparable is shown in Fig. 1. This quantization speeds up inference and lowers the model’s memory footprint, which is especially advantageous for LLMs with millions or billions of parameters.

### Types of Quantization

#### 1. Post-Training Quantization:

In Post-Training Quantization a pre-trained model is quantized **without retraining**. This method is quick and easy to implement but may lead to a slight drop in model accuracy. PTQ is ideal for scenarios where rapid deployment is essential, and slight accuracy loss is acceptable.

#### 2 Quantization-Aware Training:

Quantization Aware Training involves training the model with quantization in mind. The model improves accuracy over PTQ by learning to handle the decreased precision during training. Although QAT creates highly efficient models for inference, it demands greater processing power during training.

#### 3 Dynamic Quantization

Dynamic quantization applies quantization only during inference, leaving the model in its original precision during training. This method is a middle ground between PTQ and QAT, offering moderate improvements in efficiency without extensive retraining.

### Benefits and Trade-offs

Quantization can result in faster processing and less memory use. It may, however, also lead to a minor loss of accuracy, which requires careful handling.

## Step-by-Step Guide to Implementing Quantization

### Tensorflow model

In this section, we will be performing model quantization on the MobileNetV2 model and comparing its size before and after quantization. MobileNetV2 is a popular deep learning model architecture used for tasks like image classification, and it is well-known for being computationally efficient. Quantization further helps in reducing the model size and improving inference speed. The size before quantization was 8.45MB, whereas, after quantization, it is around 2.39 MB, which has been significantly reduced.

**Step 1: Import Dependencies**

```
import os # For file operations
import tensorflow as tf # For working with the MobileNet model and TF Lite conversion
```

**Step 2: Load the MobileNetV2 Model**

`model = tf.keras.applications.MobileNetV2(weights='imagenet', input_shape=(96, 96, 3), include_top=False)`

We load a pre-trained MobileNetV2 model with ImageNet weights, specifying an input shape of 96x96x3 and excluding the top layers.

**Step 3: Convert to Non-Quantized TFLite Model**

```
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
tflite_model_path = '/content/mobilenetv2_float.tflite'
with open(tflite_model_path, 'wb') as f:
f.write(tflite_model)
```

We convert the Keras model to a TensorFlow Lite model without quantization and save it to a file.

**Step 4: Convert to Quantized TFLite Model**

```
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_tflite_model = converter.convert()
quantized_model_path = '/content/mobilenetv2_quantized.tflite'
with open(quantized_model_path, 'wb') as f:
f.write(quantized_tflite_model)
```

We create another converter, this time applying default optimizations (which includes quantization), convert the model, and save it to a file.

**Step 5: Define Utility Function for Model Size and Print Results**

```
def get_model_size(file_path):
return os.path.getsize(file_path) / (1024 * 1024) # Size in MB
print(f"Non-quantized model size: {get_model_size(tflite_model_path):.2f} MB")
print(f"Quantized model size: {get_model_size(quantized_model_path):.2f} MB")
```

**Output:-**

Fig.2 Model Size Reduction using Tensorflow

If you want to convert your model to use 16-bit quantization (which uses float16 representation), you can specify it explicitly using

*[tf.float16] Instead of [tf.lite.Optimize.DEFAULT]*

### PyTorch Implementation

In this Section, we will be performing model quantization on the BERT model and will be comparing its Size, Speed, and Throughput before and after quantization. BERT is a popular open-source machine learning framework for natural language processing (NLP). Here, we will be using text summarization to draw conclusions between quantized and non-quantized models.

A sample text will be provided to both models, and its output will be judged in terms of different factors. Here, it was observed that the size before quantization was 417.72 MB, whereas, after quantization, it is around 173.08 MB, which has been significantly reduced. Also, the output results were much faster than the non-quantized model, with an improvement of around 27%.

**Step 1: Import Dependencies**

```
import torch # For PyTorch operations and quantization
from transformers import BertModel, BertTokenizer # To work with BERT models
import time # For measuring inference time
import os # For file operations
import numpy as np # For numerical operations
from sklearn.metrics.pairwise import cosine_similarity # For calculating cosine similarity
```

We import necessary libraries for working with BERT, measuring time and file sizes, and performing numerical operations.

**Step 2: Define Utility Functions**

Model Size Calculation:-

```
def get_model_size(file_path):
return os.path.getsize(file_path) / (1024 * 1024) # Size in MB
Inference Time Measurement:-
def measure_inference_time(model, inputs, num_iterations=50):
model.eval() # Set to evaluation mode
# Measure time
total_time = 0
with torch.no_grad():
for _ in range(num_iterations):
start_time = time.time()
_ = model(**inputs)
total_time += time.time() - start_time
return total_time / num_iterations
```

This function measures the average inference time of a model over multiple iterations.

Simple Summarization:-

```
def simple_summarize(model, tokenizer, text):
# Tokenize the text into sentences
sentences = text.split('. ')
sentence_embeddings = []
# Get embeddings for each sentence
for sentence in sentences:
inputs = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
sentence_embeddings.append(outputs.last_hidden_state.mean(dim=1).squeeze().numpy())
# Calculate the mean embedding
mean_embedding = np.mean(sentence_embeddings, axis=0)
# Find the sentence with the highest cosine similarity to the mean embedding
similarities = [cosine_similarity(mean_embedding.reshape(1, -1), embedding.reshape(1, -1))[0][0] for embedding in sentence_embeddings]
most_representative_sentence = sentences[np.argmax(similarities)]
return most_representative_sentence
```

This function creates a simple summary of a text by finding the sentence most similar to the mean embedding of all sentences.

**Step 3: Load Tokenizer and Prepare Input**

```
# Load tokenizer and prepare input
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = """
```

The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.

It is named after the engineer Gustave Eiffel, whose company designed and built the tower.

Constructed from 1887 to 1889 as the entrance arch to the 1889 World’s Fair, it was initially criticized by some of France’s leading artists and intellectuals for its design, but it has become a global cultural icon of France and one of the most recognizable structures in the world.

The Eiffel Tower is the most-visited paid monument in the world; 6.91 million people ascended it in 2015.

The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris.

```
"""
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
```

We load the BERT tokenizer and prepare our input text for processing.

**Step 4: Load and Benchmark Original Model**

```
print("Loading and benchmarking original model...")
model = BertModel.from_pretrained('bert-base-uncased')
torch.save(model.state_dict(), "bert_original.pth")
original_time = measure_inference_time(model, inputs)
original_summary = simple_summarize(model, tokenizer, text)
```

We load the original BERT model, save it, measure its inference time, and generate a summary.

**Step 5: Quantize and Benchmark Quantized Model**

```
print("Quantizing and benchmarking quantized model...")
quantized_model = torch.quantization.quantize_dynamic(
model,
{torch.nn.Linear},
dtype=torch.qint8
)
torch.save(quantized_model.state_dict(), "bert_quantized.pth")
quantized_time = measure_inference_time(quantized_model, inputs)
quantized_summary = simple_summarize(quantized_model, tokenizer, text)
```

**Step 6: Compare and Print Results**

```
print("\nResults:")
print(f"Original model size: {get_model_size('bert_original.pth'):.2f} MB")
print(f"Quantized model size: {get_model_size('bert_quantized.pth'):.2f} MB")
print(f"Original model inference time: {original_time*1000:.2f} ms")
print(f"Quantized model inference time: {quantized_time*1000:.2f} ms")
print(f"Speed improvement: {((original_time - quantized_time) / original_time * 100):.2f}%")
print(f"Size reduction: {((get_model_size('bert_original.pth') - get_model_size('bert_quantized.pth')) / get_model_size('bert_original.pth') * 100):.2f}%")
print("\nSummarization Results:")
print(f"Original model summary: {original_summary}")
print(f"Quantized model summary: {quantized_summary}")
if original_summary == quantized_summary:
print("Both models produced the same summary.")
else:
print("The summaries differ between the original and quantized models.")
```

**Output:-**

Fig.3. Comparison between different parameters of Quantized and Non- Quantized Model

Fig.4 Summary Output of Quantized and Original Model

## Measuring Performance Improvements

After quantizing a model, it’s crucial to track key metrics like **latency, throughput, and accuracy** to assess its performance. Quantization typically reduces model size and speeds up inference, but it can also impact accuracy. To gauge these effects, measure latency (inference time), throughput (samples processed per second), and compare the quantized model’s accuracy against the original. Running benchmarks for both models on the same dataset allows you to quantify improvements in performance while ensuring accuracy degradation remains within acceptable limits. This process helps ensure the model is both efficient and reliable post-quantization.

In the case of Model Quantization using PyTorch on the BERT Model, the Latency improvement was 27.42%, and the throughput improvement was 37.78%. Both quantized and the original models produced the same text summary, indicating there was not much change in accuracy.

Fig.5. Latency and Throughput improvement

## Common Challenges and Solutions

While quantization is powerful for optimizing the LLM Inference, Some challenges may arise like:

### Addressing Accuracy Loss:

Accuracy loss can be addressed by implementing techniques like mixed-precision training or fine-tuning to mitigate degradation.

### Hardware Considerations:

Ensure your hardware can effectively utilize quantized models, considering both CPU and GPU capabilities.

## Alternative Approaches for LLM Inference Optimization

Several other methods, like Model Pruning, Knowledge Distillation, Caching and Memory Management, Low-Rank Decomposition, Tensor Parallelism, Hardware Acceleration, Dynamic Batching, and others, aid in optimizing LLM inference in addition to quantization. Combining these methods results in a faster LLM that can even operate in contexts with limited resources.

## Conclusion

Optimizing LLM inference through quantization is a powerful strategy that can dramatically enhance performance while slightly reducing accuracy. The Quantized model we have tested in this blog has shown a speed improvement of 27% and a size reduction of 58%. By understanding the principles of quantization and following the implementation steps outlined in this guide, you can leverage this technique to improve your applications effectively.