A Hands-on Guide to Multilingual Visual Document Retrieval with VDR-2B-Multi-V1

vdr-2b-multi-v1 transforms visual document retrieval with multilingual embeddings, faster inference, and reduced VRAM usage. This article delves into its architecture, training, and groundbreaking applications.

The VDR-2B-Multi-V1 model represents a new milestone in visual document retrieval, enabling efficient and multilingual search without the need for OCR, data extraction pipelines, or chunking. Designed to encode visually rich document pages into dense single-vector embeddings, this model is a game-changer for multilingual and domain-specific applications.

Trained on a cutting-edge dataset of multilingual query-image pairs, VDR-2B-Multi-V1 significantly enhances cross-lingual retrieval, inference speed, and resource efficiency. This blog explores its architecture, training methodologies, evaluation results, and the future possibilities it unlocks.

Table of Content

  1. What is vdr-2b-multi-v1?
  2. Key Features and Innovations
  3. Training Dataset and Methods
  4. Hands on Implementation
  5. Evaluation and Results
  6. Applications and Implications

What is vdr-2b-multi-v1?

The vdr-2b-multi-v1 model is a multilingual embedding framework optimized for visual document retrieval across diverse languages and domains. Built on MrLight/dse-qwen2-2b-mrl-v1, it incorporates advancements in Matryoshka Representation Learning (MRL), low VRAM usage, and faster inference times. The model supports five languages—Italian, Spanish, English, French, and German—and enables cross-lingual document retrieval, such as querying German documents with Italian text.

Key Features and Innovations

Multilingual Embedding for Cross-Lingual Retrieval

vdr-2b-multi-v1 excels in cross-lingual search scenarios, outperforming previous models in multilingual benchmarks.

Low VRAM and Faster Inference

With only 768 image tokens compared to the base model’s 2560, It delivers 3x faster inference and significantly reduced VRAM usage.

Matryoshka Representation Learning (MRL)

MRL enables dimensional reduction, maintaining 98% of embedding quality with 3x smaller vectors, improving retrieval speed and storage efficiency.

High-Quality Training Dataset

Built on 500k samples, the dataset includes multilingual query-image pairs, curated for high diversity and quality.

image/png

Inference Comparison

Training Dataset and Methods

Dataset Overview

The training dataset comprises 500k query-image pairs across five languages, curated using public PDFs and advanced layout analysis models.

LanguageFiltered QueriesUnfiltered Queries
English53,51294,225
Spanish58,738102,685
Italian54,94298,747
German58,217100,713
French55,27099,797
Synthetic Query Generation

Queries were generated using Gemini-1.5-pro and Qwen2-VL-72B, which were tasked with producing both general and specific queries for improved information retrieval.

Filtering and Hard-Negative Mining

A meticulous cleaning and filtering process ensured high-quality queries. Hard negatives were mined using voyage-3, refining the dataset to improve model robustness.

Hands on Implementation

Step 1: Install Required Libraries 

Step 2: Load Model

Step 3: Generate Embeddings

Step 4: Print Embeddings

Output:

Huggingface Demo Output:

Evaluation and Results

ViDoRe Benchmark Results

It demonstrated significant performance gains across all tested languages and page types:

MetricBase Model (dse-qwen2-2b-mrl-v1)vdr-2b-multi-v1Improvement
French (Visual)90.893.3+2.2%
German (Visual)90.095.7+6.3%
Italian (Visual)94.096.3+2.0%
Spanish (Visual)94.796.9+2.2%
English (Visual)98.599.1+0.6%
Faster Inference

The English-only version (vdr-2b-v1) matches the base model’s performance on the ViDoRe benchmark using only 30% of the image tokens.

Model VariantInference SpeedVRAM Usage
Base Model (2560 Tokens)BaselineHigh
vdr-2b-multi-v1 (768 Tokens)3x FasterLow

Multilingual Capabilities Comparison

Applications and Implications

Multilingual Visual Document Retrieval

Search for multilingual documents using queries in another language, enabling seamless cross-lingual access to resources such as legal documents, instruction manuals, and scientific papers.

Domain-Specific Applications

From healthcare to government archives, the model’s adaptability ensures effective retrieval in specialized domains.

Synthetic Data Generation and Sim2Real Transfer

The model aids in generating high-quality synthetic datasets, accelerating research in domains where data scarcity is a concern.

Final Words

The vdr-2b-multi-v1 model redefines the landscape of visual document retrieval, combining multilingual capability, efficiency, and scalability. By leveraging cutting-edge training techniques and datasets, it offers unmatched performance and resource efficiency.

References

vdr-2b-multi-v1‘s Huggingface Repository

Picture of Aniruddha Shrikhande

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.