A Hands on Guide to Compact Vision Language Models using SmolDocling

SmolDocling, a 256M VLM, enables efficient document conversion using DocTags to preserve structure while reducing computation.

Document conversion is challenging, traditionally relying on complex ensembles or resource-heavy VLMs. SmolDocling, a 256M parameter VLM, offers efficient, end-to-end multimodal conversion. 1 It processes full document pages, preserving elements like text, tables, and layouts using DocTags, a novel markup format, providing a compact yet powerful solution. This article explores SmolDocling’s architecture, capabilities, and implementation in detail.

Table of Content

  1. What is SmolDocling?
  2. Understanding the Architecture
  3. Key Features and Capabilities
  4. Step by Step Implementation Guide
  5. Real World Applications

Let’s start by understanding what SmolDocling is.

What is SmolDocling?

Traditionally, document conversion has relied on either resource intensive big VLMs or intricate ensemble based methods. Ensemble approaches, which combine specialized models like layout analysis and OCR, have trouble generalizing and finetuning. Single shot conversion is possible with large VLMs like GPT-4o and Qwen2.5-VL, but they need a significant amount of processing power.

DocTags format

DocTags format

SmolDocling uses an optimized architecture that strikes a compromise between accuracy and efficiency in order to overcome these constraints. This simplified method bridges the gap between computationally costly big VLMs and specialized ensemble models by drastically lowering computational overhead while preserving state of the art performance.

Understanding the Architecture

SmolVLM-256M utilizes a 93M-parameter SigLIP base encoder for efficient image compression via pixel shuffling and a 135M-parameter SmolLM-2 language model for autoregressive DocTag prediction. This design enables high performance with a compact architecture. DocTags, a lossless markup system, accurately represents document elements such as text, lists, and charts. This architecture allows SmolDocling to rival or surpass VLMs up to 27 times larger while significantly reducing computational demands, making it a resource efficient solution for document understanding.

SmolDocling/SmolVLM architecture.

SmolDocling/SmolVLM architecture.

Key Features and Capabilities

SmolDocling introduces several key innovations that enhance document conversion:

  • End to end document parsing: Processes entire pages instead of handling elements separately.
  • Multimodal element recognition: Extracts tables, charts, equations, lists, and code snippets with precise formatting.
  • High accuracy text recognition (OCR-free): Achieves superior F1-scores compared to leading VLMs in structured text recognition.
  • Compact model size: Reduces memory and computation requirements, making it deployable on standard hardware.
  • Optimized training with DocTags: Captures document layout and spatial relationships between elements efficiently.

Step by Step Implementation Guide

Want to test SmolDocling? Follow these steps:

Step 1. Install Dependencies

Step 2: Import Required Libraries

We begin by importing the necessary Python libraries for processing images, making HTTP requests, and handling AI models.

Step 3: Set Up Device Configuration

Detect whether CUDA (GPU acceleration) is available and set the computation device accordingly.

Step 4: Load Image from URL

Download an image from a specified URL with appropriate headers to avoid request blocks.

Step 5: Initialize Model and Processor

Load the SmolDocling-256M-preview processor and model to process the document image.

Step 6: Create Input Messages

Prepare messages for the model, including both the image and text instructions.

Step 7: Prepare Inputs for Model Processing

Format the input message with the chat template and process it into tensors.

Step 8: Generate Structured Output

Use the model to generate structured document tags from the input image.

Step 9: Populate Docling Document

Convert the extracted document tags into a DocTagsDocument and then a DoclingDocument.

Output

Real World Applications

SmolDocling’s versatility enables its application across various domains, including:

  • Business document processing: Automating invoice, contract, and report extraction.
  • Academic research: Digitizing and structuring scientific papers.
  • Technical documentation conversion: Preserving code snippets, formulas, and tables for software engineering workflows.
  • Patent and legal document analysis: Extracting structured insights from complex legal texts.

Final Words

SmolDocling demonstrates that compact, efficient models can outperform larger counterparts in document conversion tasks. By introducing DocTags, it provides a structured, lossless representation of document content, making it ideal for enterprise applications. With its groundbreaking approach, SmolDocling paves the way for scalable, high accuracy document conversion in the era of AI driven automation.

References

SmolDocling Research Paper

Picture of Aniruddha Shrikhande

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.