Deep Dives

Hands-on Guide to Vision Language Tasks using Microsoft’s Florence-2

Explore Microsoft's Florence-2: Unifying vision and language tasks with prompt-based AI integration.

Explore more from ADaSci

Building a scalable real-time ML inference platform for AIOps

Interview with Yusuf, Chartered Data Scientist – The Data Science Trailblazer

Unlocking the Power of AI: A Deep Dive into Triad Image Classification

AI-Powered Image-to-JSON Conversion for LLM Fine-Tuning Using ‘Outlines’

A Practical Guide to Tracing and Evaluating LLMs Using LangSmith

Observing and Examining AI Agents through AgentOps

A Hands-On Guide to LLM Observability using Portkey

ML based high-cardinality reduction methods to create geo-score to improve auto insurance Tweedie pricing model

Why Choosing the Right Learning Partner is Crucial for Enterprise Success

Unlocking Insights: AI’s Role in Oil Data Search

Microsoft released the open-source vision foundation model named Florence-2 in June 2024 which introduces a novel approach encapsulating a unified, prompt-based representation for various vision-language and computer vision tasks. Florence-2 was created to incorporate text prompts as task instructions and generate results in textual format. These tasks include image captioning, object detection, grounding, OCR or segmentation. This article explains Florence-2 in detail with hands-on implementation.

Understanding Florence-2
Florence-2 Model Architecture
FLD-5B Data Engine
Using Florence-2-Large for Image Caption Generation

Understanding Florence-2

Florence-2 is an open-source vision language model that demonstrates exceptional zero-shot and fine-tuning capabilities based on various computer vision tasks such as captioning, object detection, OCR, grounding and segmentation. The model uses FLD-5B data which consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. The FLD-5B data uses different types of annotations such as bounding boxes, masks and captions.

The FLD-5B annotations support Florence-2 in generating detailed and accurate image descriptions, identifying and localising objects in an image, applying segmentation, and locating specific objects or concepts as per the mentioned text within an image.

Florence-2 Unified Architecture

Florence-2 comes in two different parameter sizes – 0.23B (Base) and 0.77B (Large), making it significantly smaller than other powerful vision models, allowing it to run on devices with limited processing power. The model employs a unified architecture that uses a sequence-to-sequence learning paradigm, treating both images and texts as sequences, allowing it to handle different tasks under a common framework.

Florence-2 uses a combination of spatial hierarchy and semantic granularity. Spatial hierarchy refers to the arrangement of objects and their relative positions within an image. For instance, spatial hierarchy concerning an “image of a living room” can be based on identifying furniture objects such as the couch, coffee table, and chairs along with their relative positions. Semantic granularity, on the other hand, refers to the level of detail in the meaning assigned to objects or concepts. For instance, “Dog” is a more general term, whereas, “Siberian Husky” provides more specific information about the breed of dog.

Florence-2 Model Architecture and Data Engine

Florence-2 uses a unified sequence-to-sequence architecture for tackling tasks through a single model. The model uses the following components:

Image Encoder – It takes an image input and processes it through a CNN, the CNN extracts the visual features capturing the image’s content, shapes, edges, colours, etc. The output of this image encoder is a numerical representation of the features.

Text Encoder – Text prompts describing specific tasks are fed into a separate encoder. This encoder converts the text prompts into numerical representation, capturing the semantics of the language used.

Multi-Modal Fusion – The encoded image and the text representations are then combined using techniques such as attention mechanisms that allow the model to understand how the visual information relates to the task or concept specified in the text prompt.

Decoder – This component takes the fused representation and generates a text output based on the task and prompt.

Florence-2 Architecture

FLD-5B Data Engine

To train the Florence-2 model, a comprehensive, large-scale, high-quality multitask dataset FLD-5B was used which includes 126M images, 500M text annotations, 1.3B text-region annotations and 3.6B text-phrase region annotations across different tasks.

Florence-2 Data Engine

Text Annotations in FLD-5B dataset

Using Florence-2-Large for Image Caption Generation

Let’s implement an image caption generation using Florence-2 using <CAPTION>, and <MORE_DETAILED_CAPTION> task prompts.

Step 1: Installing the required libraries –

timm – PyTorch Image Models library provides efficient implementations of popular computer vision models, primarily for image classification.
flash_attn – This is the library used for implementing fast and memory-efficient attention.
einops – Used for tensor manipulation codes more readable, efficient and concise.

!pip install timm flash_attn einops

Step 2: Importing the libraries – AutoProcessor handles preprocessing image inputs for the model, AutoModelForCausalLM, PIL is the Python Imaging Library provides extensive support for opening, manipulating and saving image files, and Requests library allows the users to send HTTP requests using Python.

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import requests

Step 3: Loading the pre-trained model – using HuggingFace model_id, model parameters. The trust_remote_code = True argument acknowledges the potential security risks when downloading the pre-trained model.

model_id = 'microsoft/Florence-2-large'
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True).eval()
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

Step 4: Defining the run_example prediction function – This function takes two arguments, task_prompt (specifying the task) and optional text_input (additional prompt text) and constructs the final prompt by combining them. The processor function is used to prepare the model inputs.

The model.generate function takes input_ids (processed text input), pixel_values (processed image input), max_new_tokens (maximum tokens to generate – limiting the caption length), num_beams (controls the beam search decoding strategy for generating text).

The generated text IDs are decoded back into a human-readable format using processing.batch_decode. Any post-processing specific to the task is done using processor.post_process_generation. The parsed_answer (generated caption) is then printed.

def run_example(task_prompt, text_input=None):
   if text_input is None:
       prompt = task_prompt
   else:
       prompt = task_prompt + text_input

   inputs = processor(text=prompt, images=image, return_tensors="pt")

   generated_ids = model.generate(
     input_ids=inputs["input_ids"],
     pixel_values=inputs["pixel_values"],
     max_new_tokens=1024,
     num_beams=3

   )

   generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]

   parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))

   print(parsed_answer)

Step 5: Downloading an Image and running the function to generate captions – The code defines a URL for an image and downloads it using requests.get function. It then opens the downloaded image as a PIL image object which is used for caption generation using the run_example function.

<CAPTION> generates a short caption for the images, whereas, <MORE_DETAILED_CAPTION> generates a more elaborate version of the caption.

Image (1)

url = "https://www.looper.com/img/gallery/the-ending-of-harry-potter-explained/intro.jpg?download=true"

image = Image.open(requests.get(url, stream=True).raw)
image

task_prompt = '<CAPTION>'
run_example(task_prompt)

Output (1.1)

{‘<CAPTION>’: ‘harry potter and the deathly hallows part 2’}

task_prompt = '<MORE_DETAILED_CAPTION>'
run_example(task_prompt)

Output (1.2)

{‘<MORE_DETAILED_CAPTION>’: ‘The image is a still from the movie Harry Potter and the Deathly Hallows Part 2. It shows three characters, Hermione Granger, Ron Weasley, and Harry Potter, crouching down in a dimly lit alleyway. Hermione is on the left side of the image, wearing a denim jacket and holding a wand. Ron is in the middle, wearing glasses and a brown jacket, and Ron is behind her. All three characters are looking at Hermione with a serious expression on their faces. The alleyway appears to be made of stone and there is a stone wall on the right side.’}

Image (2)

url = "https://amueller.github.io/word_cloud/_images/a_new_hope.png?download=true"

image = Image.open(requests.get(url, stream=True).raw)
image

task_prompt = '<MORE_DETAILED_CAPTION>'
run_example(task_prompt)

Output (2)

{‘<MORE_DETAILED_CAPTION>’: ‘The image is a black and white word cloud in the shape of a skull. The word cloud is made up of various words related to the Star Wars universe, such as “Luke Skywalker”, “The Empire Strikes Back”, “Red Leader”, “Han Solo”, “Death Star”, and “Vader-see”. The words are arranged in a circular pattern around the skull, creating a sense of depth and dimension. The background is completely black, making the words stand out.’}

The captions generated are accurate based on the given images.

Final Words

Florence-2 can understand both spatial hierarchy and semantic granularity based on the FLD-5B data engine and presents a unified approach to generate accurate and informative image descriptions, perform object detection and segmentation and answer visual question prompts with efficiency and speed. This marks a significant step towards more powerful and versatile vision-language models that can bridge visual information and human language understanding.

References

Learn more about Generative AI tools and techniques through our hand-picked courses:

ADaSci Certified Generative AI Engineer

₹21,320.00

Add to cart
Product on sale

Industry Applications of Large Language Models: Transforming Business Landscape

Original price was: ₹1,712.00.Current price is: ₹0.00.

Add to cart
CDS Video Series | Sec07. COMPUTER VISION

₹1,284.00

Add to cart

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our AI Courses

Build AI Agents with Google ADK
₹1,713.00
Add to cart

Our Latest Courses

Hands-on Guide to Vision Language Tasks using Microsoft’s Florence-2

Explore more from ADaSci

Table of Contents

Understanding Florence-2

Florence-2 Model Architecture and Data Engine

FLD-5B Data Engine

Using Florence-2-Large for Image Caption Generation

Final Words

References

Sachin Tripathi

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Build AI Agents with Google ADK

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal