Extracting structured data from images is a crucial step for fine-tuning large language models (LLMs) to handle domain-specific tasks. This hands-on guide demonstrates how to transform receipt images into JSON datasets using Outlines, a Python library designed for simplifying workflows with LLMs. By leveraging vision-language models (VLMs) like Qwen-2-VL or Pixtral, you’ll learn to process images, extract key details such as store names, item lists, and totals, and prepare structured outputs ready for fine-tuning. Whether you’re a researcher or developer, this guide equips you with practical tools to streamline dataset creation.
Table of Content
- Introduction to Outlines
- Understanding Outlines Architecture
- Practical Implementation
Introduction to Outlines
Outlines is a powerful Python library designed to simplify and enhance text generation workflows with Large Language Models (LLMs). Built by .txt, Outlines excels in structured generation, ensuring outputs like valid JSON or text adhering to complex patterns like regex.
It supports both OpenAI and cutting-edge open-source models through integrations with Transformers, llama.cpp, and others, making it versatile for production use.
With features like robust prompt templating, JSON schema compliance, and seamless ecosystem compatibility, Outlines empowers developers to create reliable, efficient LLM applications with minimal overhead during inference.
Understanding Outlines’s Architecture
Outlines uses a structured generation framework to guide language models in producing text that conforms to predefined rules, such as JSON schemas or regex patterns. Unlike traditional LLM workflows that consider all potential tokens at every step, Outlines restricts generation to legal tokens only. This is achieved through integration with finite-state machines or grammar-based automata, ensuring that the output aligns with strict structural requirements.
In structured generation, rules like regex patterns are transformed into automata. For instance, a regex for decimal numbers (^\d*(\.\d+)?$) is converted into a finite-state machine that defines all permissible token transitions. If the generated sequence so far is “748,” the automata highlights valid next steps: additional digits, a decimal point, or sequence termination. This mapping ensures that only valid transitions occur.
During each generation step, the model processes the current sequence to produce token logits, representing the probabilities of potential next tokens. Outlines’ architecture modifies these logits, setting probabilities for illegal tokens to zero. This filtering narrows the token space to only valid options, from which the next token is sampled. For example, continuing “748” under the decimal number pattern may yield “748.92,” adhering precisely to the automata-defined rules.
The structured generation process provides robust outputs for use cases requiring strict formatting, such as JSON APIs or structured document creation. By combining automata-based constraints with dynamic logits processing, Outlines ensures both reliability and precision in text generation.
Practical Implementation
Step 1 : Install Required Libraries
Install the dependencies including Outlines, transformers, and additional Python libraries:
We will be using Outlines 0.1.3, as the latest version is unstable while using outlines.generate.json function.
!pip install outlines==0.1.3 torch==2.5.1 transformers accelerate pillow rich
Step 2 : Import Necessary Libraries
Import essential modules for handling language models, image processing, and structured data representation:
import outlines
import torch
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from pydantic import BaseModel, Field
from PIL import Image
import requests
from rich import print
Step 3 : Initialize the Model
Define the model class and initialize the vision-language transformer model:
We are currently using the Qwen 2B model, which provides decent results. However, for more accurate and refined outputs, you can consider using more powerful models such as Qwen2.5-72B, which has 72 billion parameters. Alternatively, other models that align with your system specifications could also be utilized for better performance
model_name = "Qwen/Qwen2-VL-2B-Instruct"
model_class = Qwen2VLForConditionalGeneration
model = outlines.models.transformers_vision(
model_name,
model_class=model_class,
model_kwargs={
"device_map": "auto",
"torch_dtype": torch.bfloat16,
},
processor_kwargs={
"device": "cuda", # Set to "cpu" if GPU is unavailable
},
)
Step 4 : Image Preprocessing
Load and resize an image to ensure it fits the model’s input size:
def load_and_resize_image(image_path, max_size=1024):
image = Image.open(image_path)
width, height = image.size
scale = min(max_size / width, max_size / height)
if scale < 1:
new_width = int(width * scale)
new_height = int(height * scale)
image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
return image
# Load image from URL and resize it
image_path = "Image_URL"
response = requests.get(image_path)
with open("receipt.png", "wb") as f:
f.write(response.content)
image = load_and_resize_image("receipt.png")
Input:
Step 5 : Define Schema with Pydantic
Create data classes for receipt information extraction:
class Item(BaseModel):
name: str
quantity: Optional[int]
price_per_unit: Optional[float]
total_price: Optional[float]
class ReceiptSummary(BaseModel):
store_name: str
store_address: str
store_number: Optional[int]
items: List[Item]
tax: Optional[float]
total: Optional[float]
date: Optional[str] = Field(pattern=r'\d{4}-\d{2}-\d{2}', description="Date in the format YYYY-MM-DD")
payment_method: Literal["cash", "credit", "debit", "check", "other"]
Step 6 : Generate the Prompt
Prepare a detailed prompt to feed into the model:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{
"type": "text",
"text": f""" You are an expert at extracting information from receipts. Please extract the information from the receipt. Be as detailed as possible -- missing or misreporting information is a crime.
Return the information in the following JSON schema:
{ReceiptSummary.model_json_schema()}
"""
},
],
}
]
processor = AutoProcessor.from_pretrained(model_name)
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
Step 7 : Set Up the Generator
Create a structured JSON generator using the Outlines library:
receipt_summary_generator = outlines.generate.json(
model,
ReceiptSummary,
sampler=outlines.samplers.greedy() # Ensures deterministic output
)
Step 8 : Generate the Output
Process the prompt and the image to extract receipt data:
result = receipt_summary_generator(prompt, [image])
print(result)
These steps enable you to load a receipt image, process it using structured generation with Outlines, and extract detailed information in a JSON format.
Output:
Step 9 : Save the Output to a JSON File
Save the extracted receipt data to a .json file for further use:
import json
from datetime import datetime
from typing import Union
from pydantic import BaseModel
def save_receipt_to_json(receipt: ReceiptSummary, filepath: Union[str, None] = None) -> str:
"""
Save a ReceiptSummary object to a JSON file.
Args:
receipt: ReceiptSummary object to save
filepath: Optional custom filepath. If None, generates automatic filename
Returns:
str: Path where the file was saved
"""
# If no filepath provided, generate one based on store and date
if filepath is None:
# Clean store name for filename
store_name = "".join(x for x in receipt.store_name if x.isalnum())
# Get date string
date_str = receipt.date if receipt.date else datetime.now().strftime("%Y%m%d")
filepath = f"receipt_{store_name}_{date_str}.json"
# Convert to JSON-serializable dict
receipt_dict = json.loads(receipt.model_dump_json())
# Save to file with nice formatting
with open(filepath, 'w', encoding='utf-8') as f:
json.dump(receipt_dict, f, indent=2, ensure_ascii=False)
return filepath
# Save the result
output_file = save_receipt_to_json(result)
print(f"Receipt saved to: {output_file}")
This step ensures that the structured data generated by the model is stored in a JSON file, making it easy to share, process, or integrate with other applications.
Final Words
In conclusion, this guide demonstrates how to efficiently create a structured JSON dataset from images using Outlines. By leveraging the power of structured generation and ensuring that only valid tokens are considered, you can maintain data integrity while preparing it for LLM fine-tuning. The final output is saved in a JSON file, ensuring that your dataset is ready for seamless integration with machine learning models, facilitating efficient model training and deployment. This method streamlines the process, ensuring both accuracy and scalability.