Skip to main content
Dex is Scaleโ€™s document understanding service that transforms unstructured documents into actionable, structured data. It is a comprehensive platform that combines advanced OCR, natural language processing, and machine learning to extract meaningful information from PDFs, images, spreadsheets, and more.

Why Use Dex?

Around 80-90% of enterprise data lives within unstructured formats such as PDFs and DOCX files. Dex solves the most common challenges of programmatic document processing:
  • Format Diversity: Process any document type with a single APIโ€”business reports, financial documents, legal contracts, healthcare records, and more.
  • Unstructured Data: Convert complex layouts into structured JSON with semantic understanding, including text, tables, charts, and infographics.
  • Quality Variations: Handle scanned, handwritten, and low-quality documents with high accuracy across multiple languages.
  • Scalability: Process thousands of documents efficiently with built-in scalable infrastructure.
  • Flexibility: Choose from multiple OCR engines and customize extraction with your own tools and workflows.

Core Primitives

Dex is designed as a capability rather than a standalone product, centered around composable primitives that can be used, extended, and combined:

File Management

Upload, retrieve, and securely store confidential documents with fine-grained access control. Supports persistent storage with metadata tracking and secure access patterns.

Parse

Convert documents into machine-readable formats using multiple OCR engines. Dex extracts:
  • Plain text in multiple languages (English, Spanish, Arabic, German, and more)
  • Tables including small and large tabular data (up to 500+ rows)
  • Checkboxes for form processing
  • Images and figures with bounding box information
  • Charts for data visualization analysis
The parsing process includes intelligent chunking strategies:
  • Block-level: Low-level separation of layout blocks
  • Section-level: Semantic separation by detecting titles and subtitles
  • Page-level: Page-by-page analysis and processing

Vector Stores

Vectorize and index parsed documents for semantic search and retrieval. Vector stores enable:
  • Semantic search over document chunks with embedding-based similarity
  • Context management for multi-file processing and large documents
  • Regex search for pattern-based extraction (dates, IDs, emails, etc.)
  • Document summarization for quick overview generation

Extract

Extract structured data from parse results or document collections using:
  • Custom schemas defined with Pydantic models
  • Natural language prompts to guide extraction
  • Citations that link extracted data to source locations
  • Confidence scores for quality assessment
  • RAG-enhanced extraction using vector store context
  • Agentic extraction with custom MCP tools for advanced workflows

Ways to Interact with Dex

Dex provides multiple interfaces to support different use cases:
  • REST API: OpenAPI-documented endpoints for direct integration
  • Python SDK: High-level wrapper for rapid development with both sync and async support
  • MCP Server: Model Context Protocol integration for agent-based workflows (coming soonโ€ฆ)

Common Use Cases

  • Financial Services: Automate invoice processing, tax document analysis, and financial report extraction.
  • Healthcare: Extract patient information from medical records, insurance claims, and healthcare forms.
  • Legal: Analyze contracts, process discovery documents, and extract key clauses and obligations.
  • Business Operations: Process HR documents, supply chain orders, customer service tickets, and business reports.

Language Support

Dex supports multi-language document processing with good support for germanic languages. For non-germanic, there are 35 languages including but not limited to: Afrikaans: ๐Ÿ‡ฟ๐Ÿ‡ฆ - Albanian: ๐Ÿ‡ฆ๐Ÿ‡ฑ - Arabic: ๐Ÿ‡ธ๐Ÿ‡ฆ - Armenian: ๐Ÿ‡ฆ๐Ÿ‡ฒ - Belarusian: ๐Ÿ‡ง๐Ÿ‡พ - Bengali: ๐Ÿ‡ง๐Ÿ‡ฉ - Bulgarian: ๐Ÿ‡ง๐Ÿ‡ฌ - Catalan: ๐Ÿ‡ช๐Ÿ‡ธ - Chinese: ๐Ÿ‡จ๐Ÿ‡ณ - Croatian: ๐Ÿ‡ญ๐Ÿ‡ท - Czech: ๐Ÿ‡จ๐Ÿ‡ฟ - Danish: ๐Ÿ‡ฉ๐Ÿ‡ฐ - Dutch: ๐Ÿ‡ณ๐Ÿ‡ฑ - English: ๐Ÿ‡ฌ๐Ÿ‡ง - Estonian: ๐Ÿ‡ช๐Ÿ‡ช - Filipino: ๐Ÿ‡ต๐Ÿ‡ญ - Finnish: ๐Ÿ‡ซ๐Ÿ‡ฎ - French: ๐Ÿ‡ซ๐Ÿ‡ท - German: ๐Ÿ‡ฉ๐Ÿ‡ช - Greek: ๐Ÿ‡ฌ๐Ÿ‡ท - Gujarati: ๐Ÿ‡ฎ๐Ÿ‡ณ - Hebrew: ๐Ÿ‡ฎ๐Ÿ‡ฑ - Hindi: ๐Ÿ‡ฎ๐Ÿ‡ณ - Hungarian: ๐Ÿ‡ญ๐Ÿ‡บ - Icelandic: ๐Ÿ‡ฎ๐Ÿ‡ธ - Indonesian: ๐Ÿ‡ฎ๐Ÿ‡ฉ - Italian: ๐Ÿ‡ฎ๐Ÿ‡น - Japanese: ๐Ÿ‡ฏ๐Ÿ‡ต - Kannada: ๐Ÿ‡ฎ๐Ÿ‡ณ - Khmer: ๐Ÿ‡ฐ๐Ÿ‡ญ - Korean: ๐Ÿ‡ฐ๐Ÿ‡ท - Lao: ๐Ÿ‡ฑ๐Ÿ‡ฆ - Latvian: ๐Ÿ‡ฑ๐Ÿ‡ป - Lithuanian: ๐Ÿ‡ฑ๐Ÿ‡น - Macedonian: ๐Ÿ‡ฒ๐Ÿ‡ฐ - Malay: ๐Ÿ‡ฒ๐Ÿ‡พ - Malayalam: ๐Ÿ‡ฎ๐Ÿ‡ณ - Marathi: ๐Ÿ‡ฎ๐Ÿ‡ณ - Nepali: ๐Ÿ‡ณ๐Ÿ‡ต - Norwegian: ๐Ÿ‡ณ๐Ÿ‡ด - Persian: ๐Ÿ‡ฎ๐Ÿ‡ท - Polish: ๐Ÿ‡ต๐Ÿ‡ฑ - Portuguese: ๐Ÿ‡ต๐Ÿ‡น - Punjabi: ๐Ÿ‡ฎ๐Ÿ‡ณ - Romanian: ๐Ÿ‡ท๐Ÿ‡ด - Russian: ๐Ÿ‡ท๐Ÿ‡บ - Serbian: ๐Ÿ‡ท๐Ÿ‡ธ - Slovak: ๐Ÿ‡ธ๐Ÿ‡ฐ - Slovenian: ๐Ÿ‡ธ๐Ÿ‡ฎ - Spanish: ๐Ÿ‡ช๐Ÿ‡ธ - Swedish: ๐Ÿ‡ธ๐Ÿ‡ช - Tagalog: ๐Ÿ‡ต๐Ÿ‡ญ - Tamil: ๐Ÿ‡ฎ๐Ÿ‡ณ - Telugu: ๐Ÿ‡ฎ๐Ÿ‡ณ - Thai: ๐Ÿ‡น๐Ÿ‡ญ - Turkish: ๐Ÿ‡น๐Ÿ‡ท - Ukrainian: ๐Ÿ‡บ๐Ÿ‡ฆ - Vietnamese: ๐Ÿ‡ป๐Ÿ‡ณ - Yiddish: ๐Ÿ‡ฎ๐Ÿ‡ฑ

Key Features

Citations and Traceability

Every extracted field can be associated with its source location (page number, bounding box, text snippet), enabling auditability and human review.

Confidence Scoring

Assigns confidence scores to extracted fields based on model outputs, helping you filter and prioritize results for downstream review.

Flexible OCR Engine Support

Choose the best OCR engine for your use caseโ€”Reducto for general documents, Scale OCR for Arabic text, or integrate your own custom engine.

Access Control

Fine-grained authorization with ReBAC (Relationship-Based Access Control) for projects, files, parse results, and vector stores.

Getting Started

To begin using Dex, youโ€™ll need a Scale account with SGP access. Once youโ€™re set up, you can:
  1. Install the Dex SDK
  2. Create a project with credentials
  3. Upload and parse documents
  4. Extract structured data using custom schemas
Dex makes document understanding accessible to developers while delivering the power and accuracy required for production applications.