Skip to main content Dex is Scaleโs document understanding service that transforms unstructured documents into actionable, structured data. It is a comprehensive platform that combines advanced OCR, natural language processing, and machine learning to extract meaningful information from PDFs, images, spreadsheets, and more.
Why Use Dex?
Around 80-90% of enterprise data lives within unstructured formats such as PDFs and DOCX files. Dex solves the most common challenges of programmatic document processing:
Format Diversity: Process any document type with a single APIโbusiness reports, financial documents, legal contracts, healthcare records, and more.
Unstructured Data: Convert complex layouts into structured JSON with semantic understanding, including text, tables, charts, and infographics.
Quality Variations: Handle scanned, handwritten, and low-quality documents with high accuracy across multiple languages.
Scalability: Process thousands of documents efficiently with built-in scalable infrastructure.
Flexibility: Choose from multiple OCR engines and customize extraction with your own tools and workflows.
Core Primitives
Dex is designed as a capability rather than a standalone product, centered around composable primitives that can be used, extended, and combined:
File Management
Upload, retrieve, and securely store confidential documents with fine-grained access control. Supports persistent storage with metadata tracking and secure access patterns.
Convert documents into machine-readable formats using multiple OCR engines. Dex extracts:
Plain text in multiple languages (English, Spanish, Arabic, German, and more)
Tables including small and large tabular data (up to 500+ rows)
Checkboxes for form processing
Images and figures with bounding box information
Charts for data visualization analysis
The parsing process includes intelligent chunking strategies:
Block-level: Low-level separation of layout blocks
Section-level: Semantic separation by detecting titles and subtitles
Page-level: Page-by-page analysis and processing
Vector Stores
Vectorize and index parsed documents for semantic search and retrieval. Vector stores enable:
Semantic search over document chunks with embedding-based similarity
Context management for multi-file processing and large documents
Regex search for pattern-based extraction (dates, IDs, emails, etc.)
Document summarization for quick overview generation
Extract structured data from parse results or document collections using:
Custom schemas defined with Pydantic models
Natural language prompts to guide extraction
Citations that link extracted data to source locations
Confidence scores for quality assessment
RAG-enhanced extraction using vector store context
Agentic extraction with custom MCP tools for advanced workflows
Ways to Interact with Dex
Dex provides multiple interfaces to support different use cases:
REST API: OpenAPI-documented endpoints for direct integration
Python SDK: High-level wrapper for rapid development with both sync and async support
MCP Server: Model Context Protocol integration for agent-based workflows (coming soonโฆ)
Common Use Cases
Financial Services: Automate invoice processing, tax document analysis, and financial report extraction.
Healthcare: Extract patient information from medical records, insurance claims, and healthcare forms.
Legal: Analyze contracts, process discovery documents, and extract key clauses and obligations.
Business Operations: Process HR documents, supply chain orders, customer service tickets, and business reports.
Language Support
Dex supports multi-language document processing with good support for germanic languages. For non-germanic, there are 35 languages including but not limited to:
Afrikaans: ๐ฟ๐ฆ - Albanian: ๐ฆ๐ฑ - Arabic: ๐ธ๐ฆ - Armenian: ๐ฆ๐ฒ - Belarusian: ๐ง๐พ - Bengali: ๐ง๐ฉ - Bulgarian: ๐ง๐ฌ - Catalan: ๐ช๐ธ - Chinese: ๐จ๐ณ - Croatian: ๐ญ๐ท - Czech: ๐จ๐ฟ - Danish: ๐ฉ๐ฐ - Dutch: ๐ณ๐ฑ - English: ๐ฌ๐ง - Estonian: ๐ช๐ช - Filipino: ๐ต๐ญ - Finnish: ๐ซ๐ฎ - French: ๐ซ๐ท - German: ๐ฉ๐ช - Greek: ๐ฌ๐ท - Gujarati: ๐ฎ๐ณ - Hebrew: ๐ฎ๐ฑ - Hindi: ๐ฎ๐ณ - Hungarian: ๐ญ๐บ - Icelandic: ๐ฎ๐ธ - Indonesian: ๐ฎ๐ฉ - Italian: ๐ฎ๐น - Japanese: ๐ฏ๐ต - Kannada: ๐ฎ๐ณ - Khmer: ๐ฐ๐ญ - Korean: ๐ฐ๐ท - Lao: ๐ฑ๐ฆ - Latvian: ๐ฑ๐ป - Lithuanian: ๐ฑ๐น - Macedonian: ๐ฒ๐ฐ - Malay: ๐ฒ๐พ - Malayalam: ๐ฎ๐ณ - Marathi: ๐ฎ๐ณ - Nepali: ๐ณ๐ต - Norwegian: ๐ณ๐ด - Persian: ๐ฎ๐ท - Polish: ๐ต๐ฑ - Portuguese: ๐ต๐น - Punjabi: ๐ฎ๐ณ - Romanian: ๐ท๐ด - Russian: ๐ท๐บ - Serbian: ๐ท๐ธ - Slovak: ๐ธ๐ฐ - Slovenian: ๐ธ๐ฎ - Spanish: ๐ช๐ธ - Swedish: ๐ธ๐ช - Tagalog: ๐ต๐ญ - Tamil: ๐ฎ๐ณ - Telugu: ๐ฎ๐ณ - Thai: ๐น๐ญ - Turkish: ๐น๐ท - Ukrainian: ๐บ๐ฆ - Vietnamese: ๐ป๐ณ - Yiddish: ๐ฎ๐ฑ
Key Features
Citations and Traceability
Every extracted field can be associated with its source location (page number, bounding box, text snippet), enabling auditability and human review.
Confidence Scoring
Assigns confidence scores to extracted fields based on model outputs, helping you filter and prioritize results for downstream review.
Flexible OCR Engine Support
Choose the best OCR engine for your use caseโReducto for general documents, Scale OCR for Arabic text, or integrate your own custom engine.
Access Control
Fine-grained authorization with ReBAC (Relationship-Based Access Control) for projects, files, parse results, and vector stores.
Getting Started
To begin using Dex, youโll need a Scale account with SGP access. Once youโre set up, you can:
Install the Dex SDK
Create a project with credentials
Upload and parse documents
Extract structured data using custom schemas
Dex makes document understanding accessible to developers while delivering the power and accuracy required for production applications.