Generative AI Crash Course for Non-Tech Professionals. Register Now >

Hands-on Guide to Vision Language Tasks using Microsoft’s Florence-2

Explore Microsoft's Florence-2: Unifying vision and language tasks with prompt-based AI integration.

Microsoft released the open-source vision foundation model named Florence-2 in June 2024 which introduces a novel approach encapsulating a unified, prompt-based representation for various vision-language and computer vision tasks. Florence-2 was created to incorporate text prompts as task instructions and generate results in textual format. These tasks include image captioning, object detection, grounding, OCR or segmentation. This article explains Florence-2 in detail with hands-on implementation. 

Table of Contents

  1. Understanding Florence-2
  2. Florence-2 Model Architecture 
  3. FLD-5B Data Engine
  4. Using Florence-2-Large for Image Caption Generation

Understanding Florence-2

Florence-2 is an open-source vision language model that demonstrates exceptional zero-shot and fine-tuning capabilities based on various computer vision tasks such as captioning, object detection, OCR, grounding and segmentation. The model uses FLD-5B data which consists of 5.4 billion comprehensive visual annotations on 126 million images, using an iterative strategy of automated image annotation and model refinement. The FLD-5B data uses different types of annotations such as bounding boxes, masks and captions. 

The FLD-5B annotations support Florence-2 in generating detailed and accurate image descriptions, identifying and localising objects in an image, applying segmentation, and locating specific objects or concepts as per the mentioned text within an image. 

Florence-2 Unified Architecture

Florence-2 comes in two different parameter sizes – 0.23B (Base) and 0.77B (Large), making it significantly smaller than other powerful vision models, allowing it to run on devices with limited processing power. The model employs a unified architecture that uses a sequence-to-sequence learning paradigm, treating both images and texts as sequences, allowing it to handle different tasks under a common framework. 

Florence-2 uses a combination of spatial hierarchy and semantic granularity. Spatial hierarchy refers to the arrangement of objects and their relative positions within an image. For instance, spatial hierarchy concerning an “image of a living room” can be based on identifying furniture objects such as the couch, coffee table, and chairs along with their relative positions. Semantic granularity, on the other hand, refers to the level of detail in the meaning assigned to objects or concepts. For instance, “Dog” is a more general term, whereas, “Siberian Husky” provides more specific information about the breed of dog. 

Florence-2 Model Architecture and Data Engine

Florence-2 uses a unified sequence-to-sequence architecture for tackling tasks through a single model. The model uses the following components:

Image Encoder – It takes an image input and processes it through a CNN, the CNN extracts the visual features capturing the image’s content, shapes, edges, colours, etc. The output of this image encoder is a numerical representation of the features. 

Text Encoder – Text prompts describing specific tasks are fed into a separate encoder. This encoder converts the text prompts into numerical representation, capturing the semantics of the language used. 

Multi-Modal Fusion – The encoded image and the text representations are then combined using techniques such as attention mechanisms that allow the model to understand how the visual information relates to the task or concept specified in the text prompt. 

Decoder – This component takes the fused representation and generates a text output based on the task and prompt. 

Florence-2 Architecture

FLD-5B Data Engine

To train the Florence-2 model, a comprehensive, large-scale, high-quality multitask dataset FLD-5B was used which includes 126M images, 500M text annotations, 1.3B text-region annotations and 3.6B text-phrase region annotations across different tasks. 

Florence-2 Data Engine

Text Annotations in FLD-5B dataset

Using Florence-2-Large for Image Caption Generation

Let’s implement an image caption generation using Florence-2 using <CAPTION>, and <MORE_DETAILED_CAPTION> task prompts. 

Step 1: Installing the required libraries – 

  1. timm – PyTorch Image Models library provides efficient implementations of popular computer vision models, primarily for image classification.  
  2. flash_attn – This is the library used for implementing fast and memory-efficient attention. 
  3. einops – Used for tensor manipulation codes more readable, efficient and concise. 

Step 2: Importing the libraries – AutoProcessor handles preprocessing image inputs for the model, AutoModelForCausalLM, PIL is the Python Imaging Library provides extensive support for opening, manipulating and saving image files, and Requests library allows the users to send HTTP requests using Python. 

Step 3: Loading the pre-trained model – using HuggingFace model_id, model parameters. The trust_remote_code = True argument acknowledges the potential security risks when downloading the pre-trained model. 

Step 4: Defining the run_example prediction function – This function takes two arguments, task_prompt (specifying the task) and optional text_input (additional prompt text) and constructs the final prompt by combining them. The processor function is used to prepare the model inputs. 

The model.generate function takes input_ids (processed text input), pixel_values (processed image input), max_new_tokens (maximum tokens to generate – limiting the caption length), num_beams (controls the beam search decoding strategy for generating text). 

The generated text IDs are decoded back into a human-readable format using processing.batch_decode. Any post-processing specific to the task is done using processor.post_process_generation. The parsed_answer (generated caption) is then printed. 

Step 5: Downloading an Image and running the function to generate captions – The code defines a URL for an image and downloads it using requests.get function. It then opens the downloaded image as a PIL image object which is used for caption generation using the run_example function. 

<CAPTION> generates a short caption for the images, whereas, <MORE_DETAILED_CAPTION> generates a more elaborate version of the caption. 

Image (1)

Output (1.1)

Output (1.2)

Image (2)

Output (2)

The captions generated are accurate based on the given images. 

Final Words

Florence-2 can understand both spatial hierarchy and semantic granularity based on the FLD-5B data engine and presents a unified approach to generate accurate and informative image descriptions, perform object detection and segmentation and answer visual question prompts with efficiency and speed. This marks a significant step towards more powerful and versatile vision-language models that can bridge visual information and human language understanding. 


  1. Link to Colab Notebook
  2. Florence-2 Research Paper
  3. Florence-2 Large HuggingFace Repo
  4. Florence-2 Base HuggingFace Repo

Learn more about Generative AI tools and techniques through our hand-picked courses:

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.