Introduction to IRIS

IRIS is Scale’s OCR capability that transforms document images and PDFs into structured text. It provides a comprehensive “layout-then-OCR” pipeline that combines layout detection, specialized OCR models, and intelligent assembly to extract meaningful information from complex documents.

Why Use IRIS?

Document processing often requires more than simple text extraction—you need to understand document structure, handle different content types, and preserve the relationships between elements. IRIS solves these challenges:

Layout-Aware Processing: Automatically detects and classifies different regions (text, tables, images, etc.) before applying specialized OCR
Multiple OCR Engines: Choose from 15+ OCR models including Tesseract, EasyOCR, PaddleOCR, Surya, and advanced vision-language models like GPT-4o and Gemini
Table-Specific Processing: Apply specialized OCR models optimized for tabular data extraction
Flexible Architecture: Easily extend with custom OCR models or layout detectors
Multi-Language Support: Handle documents in multiple languages including Arabic with specialized support
Complete Pipeline Control: Inspect and configure each step—layout detection, content extraction, and assembly

How IRIS Works

IRIS follows a three-stage pipeline:

1. Layout Detection

Detects and classifies regions in document images with bounding boxes and content types:

Text regions for paragraphs and body content
Table regions for structured tabular data
Image regions for figures and diagrams
Other content types as needed

Supported layout detection models:

rt_detr_bce: RT-DETR model (best performance)
surya_bce: Surya layout detection
yolo_bce: YOLO-based detection

2. OCR Processing

Applies specialized OCR models to each detected region based on content type:

Text OCR: Optimized for continuous text (paragraphs, headings, etc.)
Table OCR: Specialized models for accurate table extraction
Image Handling: Preserves image crops for inclusion in output

Available OCR engines include:

Open-source models: EasyOCR, Tesseract, PaddleOCR, Surya
Vision-language models: GPT-4o, Gemini (via LiteLLM), Qwen2-VL, Qwen2.5-VL
Specialized models: Arabic Nougat (small/base/large), QAARI, MBZUAI-AIN
Cloud services: Azure Computer Vision
Advanced models: SmolDocling

3. Assembly

Combines extracted content back into useful formats:

MarkdownAssembler: Converts to structured markdown (default)
JSONAssembler: Outputs detailed JSON for debugging and inspection

Core Features

Modular Architecture

Each pipeline component (layout detector, OCR model, assembler) is independently configurable and extensible. Add custom models by implementing simple adapter interfaces.

Inspection and Debugging

Save intermediate results at each pipeline stage to understand and optimize the processing:

Layout detection bounding boxes and classifications
Individual OCR outputs per region
Final assembled output

Configuration Management

Use command-line arguments or YAML configuration files for reproducible processing:

layout: rt_detr_bce
text_ocr: easyocr
table_ocr: gemini
output_dir: results
save_intermediate: true

Performance Optimization

First-time model usage downloads weights; subsequent runs use cached models for faster processing.

Common Use Cases

Document Digitization: Convert scanned documents and PDFs to searchable text
Form Processing: Extract structured data from forms with mixed content types
Table Extraction: Accurately capture tabular data from complex documents
Multi-Language Documents: Process documents in various languages including Arabic
Research and Analysis: Extract text and data from academic papers, reports, and technical documents
Archive Processing: Batch process large document collections

Getting Started

IRIS is integrated with Dex for seamless document processing:

Python Package

Use programmatically through the Dex SDK:

from dex_sdk.client import DexClient
from dex_core.models.parse_job import IrisParseEngineOptions, IrisParseJobParams

# Initialize client and project
dex_client = DexClient(base_url="your-dex-url")
project = await dex_client.create_project(name="ocr-project")

# Upload and parse document
dex_file = await project.upload_file("document.pdf")
parse_job = await dex_file.start_parse_job(
    IrisParseJobParams(options=IrisParseEngineOptions())
)
result = await parse_job.get_parse_result()

Language Support

IRIS supports multi-language OCR with models optimized for different language families. Specialized support includes:

Arabic: Arabic Nougat models, QAARI, MBZUAI-AIN, Gemini
Multi-language: EasyOCR, Tesseract, PaddleOCR with broad language coverage
Universal: Vision-language models (GPT-4o, Gemini, Qwen) supporting many languages

Key Advantages

Extensibility

Designed for easy extension. Add support for new OCR libraries, layout detectors, or assembly formats through simple adapter classes.

Transparency

Inspect every pipeline stage to understand model behavior, debug issues, and optimize results for your specific documents.

Flexibility

Mix and match components—use different OCR models for text vs. tables, choose layout detectors based on document type, and customize assembly output.

API Integration

Vision-language models (Gemini, GPT-4o) use LiteLLM for unified API access, making it easy to leverage advanced AI capabilities.

Next Steps

To start using IRIS:

Review the Getting Started with IRIS guide for step-by-step instructions
Learn about IRIS’s multi-stage OCR pipeline and configuration options
Integrate IRIS into your Dex-based document processing workflows

IRIS provides the building blocks for robust document OCR while giving you complete control over the processing pipeline.

Getting Started

Document Understanding

OCR

Workflows

Introduction to IRIS

Why Use IRIS?

How IRIS Works

1. Layout Detection

2. OCR Processing

3. Assembly

Core Features

Modular Architecture

Inspection and Debugging

Configuration Management

Performance Optimization

Common Use Cases

Getting Started

Python Package

Language Support

Key Advantages

Extensibility

Transparency

Flexibility

API Integration

Next Steps

Getting Started

Document Understanding

OCR

Workflows

​Why Use IRIS?

​How IRIS Works

​1. Layout Detection

​2. OCR Processing

​3. Assembly

​Core Features

​Modular Architecture

​Inspection and Debugging

​Configuration Management

​Performance Optimization

​Common Use Cases

​Getting Started

​Python Package

​Language Support

​Key Advantages

​Extensibility

​Transparency

​Flexibility

​API Integration

​Next Steps

Why Use IRIS?

How IRIS Works

1. Layout Detection

2. OCR Processing

3. Assembly

Core Features

Modular Architecture

Inspection and Debugging

Configuration Management

Performance Optimization

Common Use Cases

Getting Started

Python Package

Language Support

Key Advantages

Extensibility

Transparency

Flexibility

API Integration

Next Steps