Document conversion is challenging, traditionally relying on complex ensembles or resource-heavy VLMs. SmolDocling, a 256M parameter VLM, offers efficient, end-to-end multimodal conversion. 1 It processes full document pages, preserving elements like text, tables, and layouts using DocTags, a novel markup format, providing a compact yet powerful solution. This article explores SmolDocling’s architecture, capabilities, and implementation in detail.
Table of Content
- What is SmolDocling?
- Understanding the Architecture
- Key Features and Capabilities
- Step by Step Implementation Guide
- Real World Applications
Let’s start by understanding what SmolDocling is.
What is SmolDocling?
Traditionally, document conversion has relied on either resource intensive big VLMs or intricate ensemble based methods. Ensemble approaches, which combine specialized models like layout analysis and OCR, have trouble generalizing and finetuning. Single shot conversion is possible with large VLMs like GPT-4o and Qwen2.5-VL, but they need a significant amount of processing power.
DocTags format
SmolDocling uses an optimized architecture that strikes a compromise between accuracy and efficiency in order to overcome these constraints. This simplified method bridges the gap between computationally costly big VLMs and specialized ensemble models by drastically lowering computational overhead while preserving state of the art performance.
Understanding the Architecture
SmolVLM-256M utilizes a 93M-parameter SigLIP base encoder for efficient image compression via pixel shuffling and a 135M-parameter SmolLM-2 language model for autoregressive DocTag prediction. This design enables high performance with a compact architecture. DocTags, a lossless markup system, accurately represents document elements such as text, lists, and charts. This architecture allows SmolDocling to rival or surpass VLMs up to 27 times larger while significantly reducing computational demands, making it a resource efficient solution for document understanding.
SmolDocling/SmolVLM architecture.
Key Features and Capabilities
SmolDocling introduces several key innovations that enhance document conversion:
- End to end document parsing: Processes entire pages instead of handling elements separately.
- Multimodal element recognition: Extracts tables, charts, equations, lists, and code snippets with precise formatting.
- High accuracy text recognition (OCR-free): Achieves superior F1-scores compared to leading VLMs in structured text recognition.
- Compact model size: Reduces memory and computation requirements, making it deployable on standard hardware.
- Optimized training with DocTags: Captures document layout and spatial relationships between elements efficiently.
Step by Step Implementation Guide
Want to test SmolDocling? Follow these steps:
Step 1. Install Dependencies
!pip install torch
!pip install docling_core
!pip install transformers
Step 2: Import Required Libraries
We begin by importing the necessary Python libraries for processing images, making HTTP requests, and handling AI models.
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
import requests
from PIL import Image
from io import BytesIO
Step 3: Set Up Device Configuration
Detect whether CUDA (GPU acceleration) is available and set the computation device accordingly.
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
Step 4: Load Image from URL
Download an image from a specified URL with appropriate headers to avoid request blocks.
url = "https://upload.wikimedia.org/wikipedia/commons/3/35/NOTA_Infographic.png"
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers, stream=True)
if response.status_code == 200:
image = Image.open(BytesIO(response.content)).convert("RGB")
else:
raise ValueError(f"Failed to download image, status code: {response.status_code}")
Step 5: Initialize Model and Processor
Load the SmolDocling-256M-preview processor and model to process the document image.
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
).to(DEVICE)
Step 6: Create Input Messages
Prepare messages for the model, including both the image and text instructions.
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
Step 7: Prepare Inputs for Model Processing
Format the input message with the chat template and process it into tensors.
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt").to(DEVICE)
Step 8: Generate Structured Output
Use the model to generate structured document tags from the input image.
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
Step 9: Populate Docling Document
Convert the extracted document tags into a DocTagsDocument and then a DoclingDocument.
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
Output
Real World Applications
SmolDocling’s versatility enables its application across various domains, including:
- Business document processing: Automating invoice, contract, and report extraction.
- Academic research: Digitizing and structuring scientific papers.
- Technical documentation conversion: Preserving code snippets, formulas, and tables for software engineering workflows.
- Patent and legal document analysis: Extracting structured insights from complex legal texts.
Final Words
SmolDocling demonstrates that compact, efficient models can outperform larger counterparts in document conversion tasks. By introducing DocTags, it provides a structured, lossless representation of document content, making it ideal for enterprise applications. With its groundbreaking approach, SmolDocling paves the way for scalable, high accuracy document conversion in the era of AI driven automation.