Automatically producing captions for images is a problem that is extremely close to the heart of scene understanding—one of the fundamental aims of computer vision. Caption generation models must not only be sophisticated enough to address computer vision difficulties, such as detecting which objects are in an image, but they must also be able to capture and describe their connections in natural language. In this hands-on article, we will use BLIP (Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation) and Mistral 7B Large Language Model to caption an image.
Table of contents
- Understanding Image Captioning
- Overview of the VLP and BLIP model
- Image Captioning with Mistral 7B LLM and BLIP
Let’s start by understanding the core of the experimentation, which is the image caption, and how it is related to the scene understanding.
Understanding Image Captioning
Image captioning is one of the primary goals of computer vision which aims to automatically generate natural descriptions for images. It requires not only recognising salient objects in an image and understanding their interactions but also verbalizing them using natural language, which makes itself very challenging. This is the reason why the model needs to understand the scene in a particular image and then take action. Some key aspects of image captioning include:
- Analyzing the visual features and contents of an image using computer vision techniques.
- Generating a natural language description of the image using language models and sequence-to-sequence learning.
- Incorporating attention mechanisms to focus on the most relevant parts of the image when generating each word in the caption.
- Training models on large datasets of images paired with human-written captions.
Source: COCO dataset
The image is part of the famous COCO (Common Objects in Context) dataset. It contains a vast collection of images, each annotated with detailed information. Here, we could see that for a single image, there could be different captions describing the image, and every caption would be true. Let’s take an example. In the above image, the first caption is “The people are posing for a group photo.”; this statement is true to the image, but it doesn’t explain the exact scene in the image. With this, we could understand that making the model understand the scene in particular is difficult.
Overview of the VLP and BLIP model
To understand Bootstrapping Language Image Pre-training (BLIP), we must go through the Vision-language pre-training (VLP) framework. The framework works to bridge the modality gap when we talk about vision and language. With this framework, the model tries to understand how to perceive the visual world and how to understand and describe it using natural language.
The Vision-language pre-training (VLP) aims to improve the performance of downstream vision and language tasks by pre-training the model on large-scale image-text pairs. Due to the prohibitive expense of acquiring human-annotated texts, most methods use image and alt-text pairs crawled from the web. Despite the use of simple rule-based filters, noise is still prevalent in web texts.
Source: BLIP White Paper
BLIP is a new VLP (Vision-language pre-training) framework which enables a wider range of downstream tasks than existing methods. It introduces two contributions from the model and data perspective, respectively:
- Multimodal mixture of Encoder-Decoder (MED): A new model architecture for effective multi-task pre-training and flexible transfer learning. A MED can operate either as an unimodal encoder, an image-grounded text encoder, or an image-grounded text decoder. The model is jointly pre-trained with three vision-language objectives: image-text contrastive learning, image-text matching, and image-conditioned language modelling.
- Captioning and Filtering (CapFilt): A new dataset bootstrapping method for learning from noisy image-text pairs. It fine-tunes a pre-trained MED into two modules: a captioner to produce synthetic captions given web images and a filter to remove noisy captions from both the original web texts and the synthetic texts.
Image Captioning with Mistral 7B and BLIP
In this project, we would use BLIP and the Mistral 7B model to understand the scene and express it in natural language. We would use the LangChain framework to create a pipeline through which the user inputs the image and gets the captions as the output. So, let’s start by setting up the project by installing dependencies and prerequisites. Here is the code snippet.
!pip install -U transformers accelerate ctransformers langchain torch
!pip install pydantic==1.10.10
The “transformers” module is used for natural language processing tasks, providing pre-trained models and a wide range of tools for working with text data by HuggingFace. The “accelerate” module helps optimize PyTorch code for better performance on GPUs and TPUs. Since we would be using a gguf model, we need a Python binding for the Transformer models implemented in C/C++; for that, the “ctransformers” module is used.
For the project, we would work with the quantized mistral 7B model published on the Huggingface platform. Here is the model path.
MODEL_PATH = 'TheBloke/Mistral-7B-Instruct-v0.2-GGUF'
Through the ctransformer module, we can configure the model and then download the model with a particular configuration. Here is the list of configurations.
config = {
"max_new_tokens": 2048,
"context_length": 4096,
"repetition_penalty": 1.1,
"temperature": 0.6,
"top_k": 50,
"top_p": 0.9,
"stream": True,
'gpu_layers':90 #for gpu utilization
}
In the above code snippet, we are setting the max_new_token and context_length so that we can control the model’s text generation. For sampling, we would set the temperature as well as the top-k and top-p sample numbers. In the top-k sampling method, the model considers only the k most likely tokens at each step and samples from this restricted set.
Similarly, in the top-p sampling method, the model considers the smallest possible set of tokens whose cumulative probability exceeds a specified threshold. For the experimentation, we would be utilizing the T4 GPU that is provided by the Google Colab for a few hours without a subscription. To utilize the GPU layers, the gpu_layer parameter needs to be defined in the configuration, and there, we need to define the number of layers which would be utilized. In our case, it is 90, which is dependent on the GPU. One can experiment with the configuration to get better results according to the task.
Now, we would build our custom tool for the agent. The idea is to build an agent backed up with the Mistral 7B for processing and inference of the user requests. The tool would provide the agent with the ability to process the image and understand the image using the BLIP model. Here is the code snippet.
class ImageDescriptorInput(BaseModel):
"""
Input data schema for image descriptor tool.
"""
target_url: str = Field(description="URL of the target image that is to be described")
@tool("image_descriptor", return_direct=True, args_schema=ImageDescriptorInput)
def image_descriptor(target_url: str) -> str:
"""
Function to generate a textual description of an image.
Args:
target_url (str): URL of the target image.
Returns:
str: Description of the image.
"""
target_data = Image.open(requests.get(target_url, stream=True).raw).convert('RGB')
inputs = blip_processor(target_data, return_tensors="pt")
output = blip_model.generate(**inputs)
return blip_processor.decode(output[0], skip_special_tokens=True)
tools = [image_descriptor]
In the above code, a Python function image_descriptor is defined that generates a textual description of an image. The function takes a URL of an image as input, retrieves the image data from the URL, and converts it into a format suitable for processing. It then uses a pre-trained model (blip_model) along with a processor (blip_processor) to generate a textual description of the image.
Next, we would build our multimodal agent for that we require our custom tool, LLM, to stop wording which would stop the conversation between the user and system and a memory component to store the chat history. Here is the code snippet, for the complete code visit the Google Colab notebook in the references.
agent = initialize_agent(
agent="chat-conversational-description",
tools=tools,
llm=llm,
verbose=True,
early_stopping_method="generate",
memory=memory,
agent_kwargs={"output_parser": parser}
)
Now we are all set to infer the agent by giving the image as an input. Here is the code snippet and is the image that is needed to be captioned.
description = agent("Explain this image: https://images.pexels.com/photos/531035/pexels-photo-531035.jpeg")
print(description['output'])
Below is the output of the agent, here the agent is using the image_descriptor tool through which it is able to utilize the BLIP model for the vision-language understanding. Once the understanding process is done the Mistral model processes the output of the BLIP model and provides us with a final answer which is the caption for the image.
Let’s use another image but this time it would be a difficult image to understand the surroundings. Here is the image and the code snippet.
description = agent("Explain this image: https://images.pexels.com/photos/632522/pexels-photo-632522.jpeg")
print(description['output'])
Let’s have a look at the output, from the below image we are able to observe that the assistant is having a problem depicting the exact scene and understanding the landscape. At last, the assistant is able to depict the scene and give a caption in which it is able to recognize the village and country.
Conclusion
This article explains the workings of the vision-language model and how to create a custom tool with LangChain. The assistant uses the custom tool to process the image and understand the scene in the image, including all the objects and their relations.
References
Have an in-depth learning on Build your own Generative Adversarial Networks (GAN). Take the following course.