Deep Dives

A Hands-on Guide to Multilingual Visual Document Retrieval with VDR-2B-Multi-V1

vdr-2b-multi-v1 transforms visual document retrieval with multilingual embeddings, faster inference, and reduced VRAM usage. This article delves into its architecture, training, and groundbreaking applications.

Explore more from ADaSci

Transfusion Model: A Deep Exploration of Multi-Modal AI Integration

Why do Enterprises Love RAG?

AnythingLLM for Local Execution and Inferencing of LLMs: A Deep Dive

Predicting incremental sales in personalised marketing campaigns with Netlift science

Beyond the Mirror: AI Twins for Personalized Interaction Using Large Language Models

Short-Term vs Long-Term Memory in AI Agents

Enhancing Search Experiences and Natural Language Generation: Semantic Search and RAG

Uncertainty of ageing and sudden death behaviour in Lithium-ion cells: Can Deep Learning models help?

Implementing DeepSeek-R1 Locally through Llama.cpp

Solution Approach to Resolve Vehicle Routing Problem using Deep Reinforcement Learning

The VDR-2B-Multi-V1 model represents a new milestone in visual document retrieval, enabling efficient and multilingual search without the need for OCR, data extraction pipelines, or chunking. Designed to encode visually rich document pages into dense single-vector embeddings, this model is a game-changer for multilingual and domain-specific applications.

Trained on a cutting-edge dataset of multilingual query-image pairs, VDR-2B-Multi-V1 significantly enhances cross-lingual retrieval, inference speed, and resource efficiency. This blog explores its architecture, training methodologies, evaluation results, and the future possibilities it unlocks.

Table of Content

What is vdr-2b-multi-v1?
Key Features and Innovations
Training Dataset and Methods
Hands on Implementation
Evaluation and Results
Applications and Implications

What is vdr-2b-multi-v1?

The vdr-2b-multi-v1 model is a multilingual embedding framework optimized for visual document retrieval across diverse languages and domains. Built on MrLight/dse-qwen2-2b-mrl-v1, it incorporates advancements in Matryoshka Representation Learning (MRL), low VRAM usage, and faster inference times. The model supports five languages—Italian, Spanish, English, French, and German—and enables cross-lingual document retrieval, such as querying German documents with Italian text.

Key Features and Innovations

Multilingual Embedding for Cross-Lingual Retrieval

vdr-2b-multi-v1 excels in cross-lingual search scenarios, outperforming previous models in multilingual benchmarks.

Low VRAM and Faster Inference

With only 768 image tokens compared to the base model’s 2560, It delivers 3x faster inference and significantly reduced VRAM usage.

Matryoshka Representation Learning (MRL)

MRL enables dimensional reduction, maintaining 98% of embedding quality with 3x smaller vectors, improving retrieval speed and storage efficiency.

High-Quality Training Dataset

Built on 500k samples, the dataset includes multilingual query-image pairs, curated for high diversity and quality.

Inference Comparison

Training Dataset and Methods

Dataset Overview

The training dataset comprises 500k query-image pairs across five languages, curated using public PDFs and advanced layout analysis models.

Language	Filtered Queries	Unfiltered Queries
English	53,512	94,225
Spanish	58,738	102,685
Italian	54,942	98,747
German	58,217	100,713
French	55,270	99,797

Synthetic Query Generation

Queries were generated using Gemini-1.5-pro and Qwen2-VL-72B, which were tasked with producing both general and specific queries for improved information retrieval.

Filtering and Hard-Negative Mining

A meticulous cleaning and filtering process ensured high-quality queries. Hard negatives were mined using voyage-3, refining the dataset to improve model robustness.

Hands on Implementation

Step 1: Install Required Libraries

!pip install -U llama-index-embeddings-huggingface

Step 2: Load Model

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

model = HuggingFaceEmbedding(
   model_name="llamaindex/vdr-2b-multi-v1",
   device="cuda",  # "mps" for mac, "cuda" for nvidia GPUs
   trust_remote_code=True,
)

Step 3: Generate Embeddings

image_embedding = model.get_image_embedding("/content/what7.jpg")
query_embedding = model.get_query_embedding("cos'è la garanzia di credito per le MSMEs")

Step 4: Print Embeddings

print(image_embedding)
print(query_embedding)

Output:

Huggingface Demo Output:

Evaluation and Results

ViDoRe Benchmark Results

It demonstrated significant performance gains across all tested languages and page types:

Metric	Base Model (dse-qwen2-2b-mrl-v1)	vdr-2b-multi-v1	Improvement
French (Visual)	90.8	93.3	+2.2%
German (Visual)	90.0	95.7	+6.3%
Italian (Visual)	94.0	96.3	+2.0%
Spanish (Visual)	94.7	96.9	+2.2%
English (Visual)	98.5	99.1	+0.6%

Faster Inference

The English-only version (vdr-2b-v1) matches the base model’s performance on the ViDoRe benchmark using only 30% of the image tokens.

Model Variant	Inference Speed	VRAM Usage
Base Model (2560 Tokens)	Baseline	High
vdr-2b-multi-v1 (768 Tokens)	3x Faster	Low

Multilingual Capabilities Comparison

Applications and Implications

Multilingual Visual Document Retrieval

Search for multilingual documents using queries in another language, enabling seamless cross-lingual access to resources such as legal documents, instruction manuals, and scientific papers.

Domain-Specific Applications

From healthcare to government archives, the model’s adaptability ensures effective retrieval in specialized domains.

Synthetic Data Generation and Sim2Real Transfer

The model aids in generating high-quality synthetic datasets, accelerating research in domains where data scarcity is a concern.

Final Words

The vdr-2b-multi-v1 model redefines the landscape of visual document retrieval, combining multilingual capability, efficiency, and scalability. By leveraging cutting-edge training techniques and datasets, it offers unmatched performance and resource efficiency.

References

vdr-2b-multi-v1‘s Huggingface Repository

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our AI Courses

Build AI Agents with Google ADK
₹1,714.00
Add to cart

Our Latest Courses

A Hands-on Guide to Multilingual Visual Document Retrieval with VDR-2B-Multi-V1

Explore more from ADaSci

Table of Content

What is vdr-2b-multi-v1?

Key Features and Innovations

Multilingual Embedding for Cross-Lingual Retrieval

Low VRAM and Faster Inference

Matryoshka Representation Learning (MRL)

High-Quality Training Dataset

Training Dataset and Methods

Dataset Overview

Synthetic Query Generation

Filtering and Hard-Negative Mining

Hands on Implementation

Evaluation and Results

ViDoRe Benchmark Results

Faster Inference

Applications and Implications

Multilingual Visual Document Retrieval

Domain-Specific Applications

Synthetic Data Generation and Sim2Real Transfer

Final Words

References

Aniruddha Shrikhande

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Build AI Agents with Google ADK

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal