Microsoft released the open-source vision foundation model named Florence-2 in June 2024 which introduces a novel approach encapsulating a unified, prompt-based representation for various vision-language and computer vision tasks. Florence-2 was created to incorporate text prompts as task instructions and generate results in textual format. These tasks include image captioning, object detection, grounding, OCR or segmentation. This article explains Florence-2 in detail with hands-on implementation.
Table of Contents
- Understanding Florence-2
- Florence-2 Model Architecture
- FLD-5B Data Engine
- Using Florence-2-Large for Image Caption Generation
Understanding Florence-2
Florence-2 is an open-source vision language model that demonstrates exceptional zero-shot and fine-tuning capabilities based on various computer vision tasks such as captioning, object detection, OCR, grounding and segmentation. The model uses FLD-5B data which consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. The FLD-5B data uses different types of annotations such as bounding boxes, masks and captions.
The FLD-5B annotations support Florence-2 in generating detailed and accurate image descriptions, identifying and localising objects in an image, applying segmentation, and locating specific objects or concepts as per the mentioned text within an image.
Florence-2 Unified Architecture
Florence-2 comes in two different parameter sizes – 0.23B (Base) and 0.77B (Large), making it significantly smaller than other powerful vision models, allowing it to run on devices with limited processing power. The model employs a unified architecture that uses a sequence-to-sequence learning paradigm, treating both images and texts as sequences, allowing it to handle different tasks under a common framework.
Florence-2 uses a combination of spatial hierarchy and semantic granularity. Spatial hierarchy refers to the arrangement of objects and their relative positions within an image. For instance, spatial hierarchy concerning an “image of a living room” can be based on identifying furniture objects such as the couch, coffee table, and chairs along with their relative positions. Semantic granularity, on the other hand, refers to the level of detail in the meaning assigned to objects or concepts. For instance, “Dog” is a more general term, whereas, “Siberian Husky” provides more specific information about the breed of dog.
Florence-2 Model Architecture and Data Engine
Florence-2 uses a unified sequence-to-sequence architecture for tackling tasks through a single model. The model uses the following components:
Image Encoder – It takes an image input and processes it through a CNN, the CNN extracts the visual features capturing the image’s content, shapes, edges, colours, etc. The output of this image encoder is a numerical representation of the features.
Text Encoder – Text prompts describing specific tasks are fed into a separate encoder. This encoder converts the text prompts into numerical representation, capturing the semantics of the language used.
Multi-Modal Fusion – The encoded image and the text representations are then combined using techniques such as attention mechanisms that allow the model to understand how the visual information relates to the task or concept specified in the text prompt.
Decoder – This component takes the fused representation and generates a text output based on the task and prompt.
FLD-5B Data Engine
To train the Florence-2 model, a comprehensive, large-scale, high-quality multitask dataset FLD-5B was used which includes 126M images, 500M text annotations, 1.3B text-region annotations and 3.6B text-phrase region annotations across different tasks.
Text Annotations in FLD-5B dataset
Using Florence-2-Large for Image Caption Generation
Let’s implement an image caption generation using Florence-2 using <CAPTION>, and <MORE_DETAILED_CAPTION> task prompts.
Step 1: Installing the required libraries –
- timm – PyTorch Image Models library provides efficient implementations of popular computer vision models, primarily for image classification.
- flash_attn – This is the library used for implementing fast and memory-efficient attention.
- einops – Used for tensor manipulation codes more readable, efficient and concise.
!pip install timm flash_attn einops
Step 2: Importing the libraries – AutoProcessor handles preprocessing image inputs for the model, AutoModelForCausalLM, PIL is the Python Imaging Library provides extensive support for opening, manipulating and saving image files, and Requests library allows the users to send HTTP requests using Python.
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests
Step 3: Loading the pre-trained model – using HuggingFace model_id, model parameters. The trust_remote_code = True argument acknowledges the potential security risks when downloading the pre-trained model.
model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
Step 4: Defining the run_example prediction function – This function takes two arguments, task_prompt (specifying the task) and optional text_input (additional prompt text) and constructs the final prompt by combining them. The processor function is used to prepare the model inputs.
The model.generate function takes input_ids (processed text input), pixel_values (processed image input), max_new_tokens (maximum tokens to generate – limiting the caption length), num_beams (controls the beam search decoding strategy for generating text).
The generated text IDs are decoded back into a human-readable format using processing.batch_decode. Any post-processing specific to the task is done using processor.post_process_generation. The parsed_answer (generated caption) is then printed.
def run_example(task_prompt, text_input=None):
if text_input is None:
prompt = task_prompt
else:
prompt = task_prompt + text_input
inputs = processor(text=prompt, images=image, return_tensors="pt")
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
print(parsed_answer)
Step 5: Downloading an Image and running the function to generate captions – The code defines a URL for an image and downloads it using requests.get function. It then opens the downloaded image as a PIL image object which is used for caption generation using the run_example function.
<CAPTION> generates a short caption for the images, whereas, <MORE_DETAILED_CAPTION> generates a more elaborate version of the caption.
Image (1)
url = "https://www.looper.com/img/gallery/the-ending-of-harry-potter-explained/intro.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)
image
task_prompt = '<CAPTION>'
run_example(task_prompt)
Output (1.1)
{‘<CAPTION>’: ‘harry potter and the deathly hallows part 2’}
task_prompt = '<MORE_DETAILED_CAPTION>'
run_example(task_prompt)
Output (1.2)
{‘<MORE_DETAILED_CAPTION>’: ‘The image is a still from the movie Harry Potter and the Deathly Hallows Part 2. It shows three characters, Hermione Granger, Ron Weasley, and Harry Potter, crouching down in a dimly lit alleyway. Hermione is on the left side of the image, wearing a denim jacket and holding a wand. Ron is in the middle, wearing glasses and a brown jacket, and Ron is behind her. All three characters are looking at Hermione with a serious expression on their faces. The alleyway appears to be made of stone and there is a stone wall on the right side.’}
Image (2)
url = "https://amueller.github.io/word_cloud/_images/a_new_hope.png?download=true"
image = Image.open(requests.get(url, stream=True).raw)
image
task_prompt = '<MORE_DETAILED_CAPTION>'
run_example(task_prompt)
Output (2)
{‘<MORE_DETAILED_CAPTION>’: ‘The image is a black and white word cloud in the shape of a skull. The word cloud is made up of various words related to the Star Wars universe, such as “Luke Skywalker”, “The Empire Strikes Back”, “Red Leader”, “Han Solo”, “Death Star”, and “Vader-see”. The words are arranged in a circular pattern around the skull, creating a sense of depth and dimension. The background is completely black, making the words stand out.’}
The captions generated are accurate based on the given images.
Final Words
Florence-2 can understand both spatial hierarchy and semantic granularity based on the FLD-5B data engine and presents a unified approach to generate accurate and informative image descriptions, perform object detection and segmentation and answer visual question prompts with efficiency and speed. This marks a significant step towards more powerful and versatile vision-language models that can bridge visual information and human language understanding.
References
- Link to Colab Notebook
- Florence-2 Research Paper
- Florence-2 Large HuggingFace Repo
- Florence-2 Base HuggingFace Repo
Learn more about Generative AI tools and techniques through our hand-picked courses: