Generative AI Crash Course for Non-Tech Professionals. Register Now >

Image-to-Text Generation with PaliGemma Multimodal Model: A Hands-on Guide

Explore Google's PaliGemma for seamless integration of visual and textual data in AI applications.

Artificial Intelligence is advancing rapidly, and Google has made a significant leap with its latest release – the PaliGemma model. This cutting-edge vision-language model (VLM) is designed to integrate and process visual and textual information seamlessly. By combining these capabilities, PaliGemma opens up numerous applications that require advanced understanding and generation abilities. In this article, we will explore the unique features of PaliGemma, its uses across various fields, and how we can use this powerful tool to complete some tasks.

Table of Contents

  1. Understanding PaliGemma
  2. Key Features and Capabilities
  3. Use Cases of PaliGemma
  4. Using PaliGemma for Achieving Different Tasks

Let us go through what PaliGemma is and where we can use it. We can also see the implementation of PaliGemma.

Understanding PaliGemma

PaliGemma is part of Google’s Gemma family. These lightweight, state-of-the-art open models are built for diverse AI applications. At its core, the PaliGemma model combines a Transformer decoder with a Vision Transformer image encoder, boasting an impressive 3 billion parameters (now we have 4B as well). The PaLi-3 model inspires it. 

PaliGemma combines the strengths of the SigLIP vision model and the Gemma language model, offering superior performance in tasks involving images and text. Because of this architecture, the model processes both images and text simultaneously. This model can handle complex tasks such as image and video captioning, visual question answering, object detection and reading text within images.

Key Features and Capabilities

Combination of Images and Text

PaliGemma easily understands the relationship between visual information and textual information. It can analyze images, answer questions about their content, and generate captions in multiple languages. 

Open Source and Adaptable

PaliGemma is built on open-source components, the SigLIP vision model, and the Gemma language model. This makes it easy to adapt to various tasks. This is achieved through fine-tuning (a process where the model is trained on specific datasets to improve its performance for particular applications).

Multimodal Magic

PaliGemma can also be fine-tuned for object detection and segmentation, identifying and outlining objects within an image. It can also decipher text embedded within pictures, making it valuable for tasks like document image analysis.

Pretrained and Fine-tuned Models

There are two main categories in PaliGemma. One is the general-purpose models that are pre-trained and can be fine-tuned for various tasks. The other is the research-oriented models that are already fine-tuned to specific datasets. This makes them ideal for exploring the capabilities of VLMs in specific domains.

Integration and Accessibility

To facilitate its widespread use, PaliGemma is accessible through platforms such as Kaggle, Colab notebooks, and HuggingFace. This ensures that developers and researchers can easily experiment with the model. 

Use Cases of PaliGemma

PaliGemma’s robust capabilities make it a powerful tool across various domains:

  1. Healthcare: This model can help in medical imaging analysis, helping doctors interpret complex scans and identify abnormalities with greater accuracy.
  1. Education: This can be used to develop educational tools that provide visual and textual explanations, enhancing learning experiences.
  1. Media and Entertainment: This mode can automate content creation, such as generating captions for videos and images.
  1. Accessibility: The model can understand and describe visual content. This can help in developing tools for visually impaired individuals, providing them with better access to information.

Using PaliGemma for Achieving Different Tasks

PaliGemma can be used to summarize an image, caption an image, answer a query regarding an image, and read the text in the image. We will be using PaliGemma to do some of these tasks. 

To begin with, we need to install this model’s GitHub URL along with other required packages and import all the required libraries.

!pip install -q -U accelerate bitsandbytes git+

from huggingface_hub import notebook_login
import torch
import numpy as np
from PIL import Image
import requests
from transformers import AutoTokenizer, PaliGemmaForConditionalGeneration, PaliGemmaProcessor
import torch

We will use notebook_login to log into the Hugging Face account. To log in, we have to give our Hugging Face token ID.


After adding the token:

Let us now set up the PaliGemma model. We will give a condition to determine if it has to use GPU or CPU for processing. We will then load the pre-trained PaliGemma model with optimized tensor data types.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "google/PaliGemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16)
processor = PaliGemmaProcessor.from_pretrained(model_id)

Next, we will give an input text that will have our query and the image URL or path.

input_text = "Explain how the sales are varying by each month"
img_url = ""
input_image =, stream=True).raw)

Next, the processor will consider the input text and image and add them to the GPU processor.

inputs = processor(text=input_text, images=input_image,
                 padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
inputs =

We will then print the output for the given query.

with torch.no_grad():
 output = model.generate(**inputs, max_length=496)

print(processor.decode(output[0], skip_special_tokens=True))

In the next task, we will ask the model to count the number of candies in the below image. 

The model will go through the image and find out how many candies are present in total.

input_text = "How many candies are there?"
img_url = ""
input_image =, stream=True).raw)
inputs = processor(text=input_text, images=input_image,
                 padding="longest", do_convert_rgb=True, return_tensors="pt").to("cuda")
inputs =
with torch.no_grad():
 output = model.generate(**inputs, max_length=496)

print(processor.decode(output[0], skip_special_tokens=True))

Output will be something like this:

Thus, by using PaliGemma, we can perform many tasks that will help us understand an image, caption an image, or describe an image. Just by using one single model, these tasks have become very easy and simple to achieve. By using the strengths of the SigLIP vision and Gemma language models, PaliGemma processes images and text simultaneously.


Google’s PaliGemma is a major advancement in vision-language models, capable of seamlessly integrating and processing both visual and textual data. This integration opens up a wide array of new AI applications across various industries. By making PaliGemma accessible and focusing on responsible AI practices, Google is enabling innovative solutions that could significantly benefit society.


  1. Link to Code
  2. PaliGemma – HuggingFace
  3. HuggingFace Documentation 

Join the below courses to learn more about AI Application Development with Azure, Google Vertex AI and RAG with Vector Databases.

Picture of Shreepradha Hegde

Shreepradha Hegde

Shreepradha is an accomplished Associate Lead Consultant at AIM, showcasing expertise in AI and data science, specifically Generative AI. With a wealth of experience, she has consistently demonstrated exceptional skills in leveraging advanced technologies to drive innovation and insightful solutions. Shreepradha's dedication and strategic mindset have made her a valuable asset in the ever-evolving landscape of artificial intelligence and data science.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.