Deep Dives

A Hands on Guide to Compact Vision Language Models using SmolDocling

SmolDocling, a 256M VLM, enables efficient document conversion using DocTags to preserve structure while reducing computation.

Explore more from ADaSci

Why Organizations Should Choose ADaSci for AI Corporate Trainings: A Case Study of Genpact’s SkyDive Global Campus Academy

Uncertainty of ageing and sudden death behaviour in Lithium-ion cells: Can Deep Learning models help?

Decision support to a Retailer’s staffing strategy using Mixed Integer Linear Programming

Exploration of Linear Model fit using observation weights and parameter constraints to handle poor quality data

Hands-on Guide to LLM Caching with LangChain to Boost LLM Responses

A Semi-Automated approach to Measure Effectiveness of Call Center Agents

The Power of Multimodal Language Models Unveiled

OpenAI’s New Guide on Prompt Engineering: Six Strategies for Better Results

Classification of different plant leaf diseases using multiple convolutional neural networks and image preprocessing

Breaking Barriers: Innovations in Point Cloud-Based AI for Complex Designs

Document conversion is challenging, traditionally relying on complex ensembles or resource-heavy VLMs. SmolDocling, a 256M parameter VLM, offers efficient, end-to-end multimodal conversion. 1 It processes full document pages, preserving elements like text, tables, and layouts using DocTags, a novel markup format, providing a compact yet powerful solution. This article explores SmolDocling’s architecture, capabilities, and implementation in detail.

Table of Content

What is SmolDocling?
Understanding the Architecture
Key Features and Capabilities
Step by Step Implementation Guide
Real World Applications

Let’s start by understanding what SmolDocling is.

What is SmolDocling?

Traditionally, document conversion has relied on either resource intensive big VLMs or intricate ensemble based methods. Ensemble approaches, which combine specialized models like layout analysis and OCR, have trouble generalizing and finetuning. Single shot conversion is possible with large VLMs like GPT-4o and Qwen2.5-VL, but they need a significant amount of processing power.

DocTags format

SmolDocling uses an optimized architecture that strikes a compromise between accuracy and efficiency in order to overcome these constraints. This simplified method bridges the gap between computationally costly big VLMs and specialized ensemble models by drastically lowering computational overhead while preserving state of the art performance.

Understanding the Architecture

SmolVLM-256M utilizes a 93M-parameter SigLIP base encoder for efficient image compression via pixel shuffling and a 135M-parameter SmolLM-2 language model for autoregressive DocTag prediction. This design enables high performance with a compact architecture. DocTags, a lossless markup system, accurately represents document elements such as text, lists, and charts. This architecture allows SmolDocling to rival or surpass VLMs up to 27 times larger while significantly reducing computational demands, making it a resource efficient solution for document understanding.

SmolDocling/SmolVLM architecture.

Key Features and Capabilities

SmolDocling introduces several key innovations that enhance document conversion:

End to end document parsing: Processes entire pages instead of handling elements separately.
Multimodal element recognition: Extracts tables, charts, equations, lists, and code snippets with precise formatting.
High accuracy text recognition (OCR-free): Achieves superior F1-scores compared to leading VLMs in structured text recognition.
Compact model size: Reduces memory and computation requirements, making it deployable on standard hardware.
Optimized training with DocTags: Captures document layout and spatial relationships between elements efficiently.

Step by Step Implementation Guide

Want to test SmolDocling? Follow these steps:

Step 1. Install Dependencies

!pip install torch
!pip install docling_core
!pip install transformers

Step 2: Import Required Libraries

We begin by importing the necessary Python libraries for processing images, making HTTP requests, and handling AI models.

import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import requests
from PIL import Image
from io import BytesIO

Step 3: Set Up Device Configuration

Detect whether CUDA (GPU acceleration) is available and set the computation device accordingly.

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

Step 4: Load Image from URL

Download an image from a specified URL with appropriate headers to avoid request blocks.

url = "https://upload.wikimedia.org/wikipedia/commons/3/35/NOTA_Infographic.png"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, stream=True)

if response.status_code == 200:
    image = Image.open(BytesIO(response.content)).convert("RGB")
else:
    raise ValueError(f"Failed to download image, status code: {response.status_code}")

Step 5: Initialize Model and Processor

Load the SmolDocling-256M-preview processor and model to process the document image.

processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")

model = AutoModelForVision2Seq.from_pretrained(
    "ds4sd/SmolDocling-256M-preview",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

Step 6: Create Input Messages

Prepare messages for the model, including both the image and text instructions.

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Convert this page to docling."}
        ]
    },
]

Step 7: Prepare Inputs for Model Processing

Format the input message with the chat template and process it into tensors.

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE)

Step 8: Generate Structured Output

Use the model to generate structured document tags from the input image.

generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]

doctags = processor.batch_decode(
    trimmed_generated_ids,
    skip_special_tokens=False,
)[0].lstrip()

Step 9: Populate Docling Document

Convert the extracted document tags into a DocTagsDocument and then a DoclingDocument.

doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)

Output

Real World Applications

SmolDocling’s versatility enables its application across various domains, including:

Business document processing: Automating invoice, contract, and report extraction.
Academic research: Digitizing and structuring scientific papers.
Technical documentation conversion: Preserving code snippets, formulas, and tables for software engineering workflows.
Patent and legal document analysis: Extracting structured insights from complex legal texts.

Final Words

SmolDocling demonstrates that compact, efficient models can outperform larger counterparts in document conversion tasks. By introducing DocTags, it provides a structured, lossless representation of document content, making it ideal for enterprise applications. With its groundbreaking approach, SmolDocling paves the way for scalable, high accuracy document conversion in the era of AI driven automation.

References

SmolDocling Research Paper

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Latest Courses

A Hands on Guide to Compact Vision Language Models using SmolDocling

Explore more from ADaSci

Table of Content

What is SmolDocling?

Understanding the Architecture

Key Features and Capabilities

Step by Step Implementation Guide

Step 1. Install Dependencies

Step 2: Import Required Libraries

Step 3: Set Up Device Configuration

Step 4: Load Image from URL

Step 5: Initialize Model and Processor

Step 6: Create Input Messages

Step 7: Prepare Inputs for Model Processing

Step 8: Generate Structured Output

Step 9: Populate Docling Document

Real World Applications

Final Words

References

Aniruddha Shrikhande

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Agentic AI Workforce Readiness Strategies for CXOs

MCP and A2A – The AI Protocols for Next-Gen Agent Ecosystems

AI-Driven Risk Management in Derivatives Trading – Webinar Recording

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal