Skip to main content
IRIS is Scale’s OCR capability that transforms document images and PDFs into structured text. It provides a comprehensive “layout-then-OCR” pipeline that combines layout detection, specialized OCR models, and intelligent assembly to extract meaningful information from complex documents.

Why Use IRIS?

Document processing often requires more than simple text extraction—you need to understand document structure, handle different content types, and preserve the relationships between elements. IRIS solves these challenges:
  • Layout-Aware Processing: Automatically detects and classifies different regions (text, tables, images, etc.) before applying specialized OCR
  • Multiple OCR Engines: Choose from 15+ OCR models including Tesseract, EasyOCR, PaddleOCR, Surya, and advanced vision-language models like GPT-4o and Gemini
  • Table-Specific Processing: Apply specialized OCR models optimized for tabular data extraction
  • Flexible Architecture: Easily extend with custom OCR models or layout detectors
  • Multi-Language Support: Handle documents in multiple languages including Arabic with specialized support
  • Complete Pipeline Control: Inspect and configure each step—layout detection, content extraction, and assembly

How IRIS Works

IRIS follows a three-stage pipeline:

1. Layout Detection

Detects and classifies regions in document images with bounding boxes and content types:
  • Text regions for paragraphs and body content
  • Table regions for structured tabular data
  • Image regions for figures and diagrams
  • Other content types as needed
Supported layout detection models:
  • rt_detr_bce: RT-DETR model (best performance)
  • surya_bce: Surya layout detection
  • yolo_bce: YOLO-based detection

2. OCR Processing

Applies specialized OCR models to each detected region based on content type:
  • Text OCR: Optimized for continuous text (paragraphs, headings, etc.)
  • Table OCR: Specialized models for accurate table extraction
  • Image Handling: Preserves image crops for inclusion in output
Available OCR engines include:
  • Open-source models: EasyOCR, Tesseract, PaddleOCR, Surya
  • Vision-language models: GPT-4o, Gemini (via LiteLLM), Qwen2-VL, Qwen2.5-VL
  • Specialized models: Arabic Nougat (small/base/large), QAARI, MBZUAI-AIN
  • Cloud services: Azure Computer Vision
  • Advanced models: SmolDocling

3. Assembly

Combines extracted content back into useful formats:
  • MarkdownAssembler: Converts to structured markdown (default)
  • JSONAssembler: Outputs detailed JSON for debugging and inspection

Core Features

Modular Architecture

Each pipeline component (layout detector, OCR model, assembler) is independently configurable and extensible. Add custom models by implementing simple adapter interfaces.

Inspection and Debugging

Save intermediate results at each pipeline stage to understand and optimize the processing:
  • Layout detection bounding boxes and classifications
  • Individual OCR outputs per region
  • Final assembled output

Configuration Management

Use command-line arguments or YAML configuration files for reproducible processing:
layout: rt_detr_bce
text_ocr: easyocr
table_ocr: gemini
output_dir: results
save_intermediate: true

Performance Optimization

First-time model usage downloads weights; subsequent runs use cached models for faster processing.

Common Use Cases

  • Document Digitization: Convert scanned documents and PDFs to searchable text
  • Form Processing: Extract structured data from forms with mixed content types
  • Table Extraction: Accurately capture tabular data from complex documents
  • Multi-Language Documents: Process documents in various languages including Arabic
  • Research and Analysis: Extract text and data from academic papers, reports, and technical documents
  • Archive Processing: Batch process large document collections

Getting Started

IRIS is integrated with Dex for seamless document processing:

Python Package

Use programmatically through the Dex SDK:
from dex_sdk.client import DexClient
from dex_core.models.parse_job import IrisParseEngineOptions, IrisParseJobParams

# Initialize client and project
dex_client = DexClient(base_url="your-dex-url")
project = await dex_client.create_project(name="ocr-project")

# Upload and parse document
dex_file = await project.upload_file("document.pdf")
parse_job = await dex_file.start_parse_job(
    IrisParseJobParams(options=IrisParseEngineOptions())
)
result = await parse_job.get_parse_result()

Language Support

IRIS supports multi-language OCR with models optimized for different language families. Specialized support includes:
  • Arabic: Arabic Nougat models, QAARI, MBZUAI-AIN, Gemini
  • Multi-language: EasyOCR, Tesseract, PaddleOCR with broad language coverage
  • Universal: Vision-language models (GPT-4o, Gemini, Qwen) supporting many languages

Key Advantages

Extensibility

Designed for easy extension. Add support for new OCR libraries, layout detectors, or assembly formats through simple adapter classes.

Transparency

Inspect every pipeline stage to understand model behavior, debug issues, and optimize results for your specific documents.

Flexibility

Mix and match components—use different OCR models for text vs. tables, choose layout detectors based on document type, and customize assembly output.

API Integration

Vision-language models (Gemini, GPT-4o) use LiteLLM for unified API access, making it easy to leverage advanced AI capabilities.

Next Steps

To start using IRIS:
  1. Review the Using IRIS with Dex guide for integration details
  2. Set up your Dex client and SGP credentials
  3. Upload documents and start parse jobs
  4. Configure OCR options for your specific document types
  5. Integrate IRIS into your document processing workflows
IRIS provides the building blocks for robust document OCR while giving you complete control over the processing pipeline.