Skip to main content
This guide will help you get started with Scale’s Dex service - a document understanding capability that extracts accurate, structured information from unstructured documents.

Overview

Dex is Scale’s document understanding capability that provides composable primitives for:
  • File Management - Secure file upload, storage, and retrieval
  • Document Parsing - Convert any document (PDFs, DOCX, images, etc.) into structured JSON format with multiple OCR engines
  • Vector Stores - Embed, index and search parsed document corpora
  • Data Extraction - Extract specific information using custom schemas, prompts, and RAG-enhanced context
  • Project Management - Organize and isolate data with proper credential management and authorization

Prerequisites

Before using Dex, ensure you have:
  • ✅ A valid Scale account with SGP (Scale General Platform) access
  • ✅ Your SGP account ID and API key set as environment variables:
export SGP_ACCOUNT_ID="your_account_id"
export SGP_API_KEY="your_api_key"
  • ✅ Python 3.8+ installed
  • ✅ Dex SDK installed (see Installation section)

Installation

Install Dex SDK from CodeArtifact

With access to Scale CodeArtifact, install the Dex SDK (version 0.4.0 or higher recommended) using your configured CodeArtifact credentials. This will install all required dependencies including Pydantic.
Note: Version 0.4.0 introduces pagination, filtering, and enhanced async job support. Version 0.3.2 introduced a new authentication method. See the Changelog for details.

Quick Start

1. Initialize Dex Client

import os
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Initialize the Dex client with SGP credentials
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

2. Create a Project

Projects isolate your data and credentials for tracing, billing, and SGP model calls. Every operation is tied to a project.
# Create a project
project = await dex_client.create_project(name="My Dex Project")
print(f"Created project: {project.data.id}")
Tip: Keep one project per use case or group of related files for clean traceability.

3. Upload a Document

Upload your document to the project. Dex supports PDFs, images, spreadsheets, and more.
# Upload a file to the project
dex_file = await project.upload_file("path/to/your/document.pdf")
print(f"Uploaded: {dex_file.data.filename}{dex_file.data.id}")
Supported file types: PDF, PNG, JPEG, DOCX, XLSX, CSV, and many more. See File Management for the complete list.

4. Parse the Document

Parse converts your document into a structured format with text, tables, and layout information.
# Parse the document with Reducto OCR
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)
print(f"Parsed: {parse_result.data.parse_metadata.pages_processed} pages")
Note: Parsing is asynchronous. The SDK automatically polls for completion.

5. Extract Structured Data

Define a schema and extract specific information from your document.
# Define your extraction schema using Pydantic
class InvoiceData(BaseModel):
    """Schema for invoice extraction"""
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="Total amount in dollars")
    date: str = Field(description="Invoice date in YYYY-MM-DD format")
    vendor_name: str = Field(description="Name of the vendor")

# Extract data with a clear prompt
extract_result = await parse_result.extract(
    extraction_schema=InvoiceData,
    user_prompt="Extract invoice details from this document.",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

# Access the extracted data
result = extract_result.result
for field_name, field in result.data.items():
    print(f"{field_name}: {field.value} (confidence: {field.confidence:.2f})")

Complete Example

Here’s a complete working example you can copy and run:
import os
import asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

class InvoiceData(BaseModel):
    invoice_number: str = Field(description="The invoice number")
    total_amount: float = Field(description="Total amount in dollars")

async def main():
    # 1. Initialize client
    dex_client = DexClient(
        base_url="https://dex.sgp.scale.com",
        api_key=os.getenv("SGP_API_KEY"),
        account_id=os.getenv("SGP_ACCOUNT_ID"),
    )

    # 2. Create project
    project = await dex_client.create_project(name="Invoice Processing")

    # 3. Upload document
    dex_file = await project.upload_file("invoice.pdf")

    # 4. Parse document
    parse_result = await dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )

    # 5. Extract data
    extract_result = await parse_result.extract(
        extraction_schema=InvoiceData,
        user_prompt="Extract invoice number and total amount.",
        model="openai/gpt-4o",
        generate_citations=True,
    )

    print(extract_result.result.data)

# Run the example
asyncio.run(main())

Next Steps

Now that you’ve completed the basics, explore these topics:

Learn More

  • File Management: Upload, pagination, and supported file types
  • Parse: Parse engines and async job monitoring
  • Chunking: Chunking strategies for documents
  • Vector Stores: Semantic search and RAG-enhanced extraction
  • Extract: Batch extraction and extraction patterns
  • Best Practices: Quick start, retention policies, and optimization

Deep Dive into the API

Additional Resources

  • REST API: For non-Python integrations, see the API Reference
  • Support: Questions? Ask in Slack channel #dex-help
  • Examples: More examples in the Introduction guide

Common Questions

Q: How do I process multiple documents? A: Upload multiple files to the same project, parse each one, then optionally use vector stores for cross-document search. See Extract. Q: Can I use a synchronous client? A: Yes! Use DexSyncClient from dex_sdk for synchronous operations. See Best Practices. Q: How do I configure data retention policies? A: Set retention policies when creating a project. See Best Practices. Q: What OCR engines are available? A: Reducto (production-ready) and Iris (experimental, for custom needs). See When to choose Iris?. Q: How do I list files with pagination? (New in v0.4.0) A: Use PaginationParams with list_files() to control page size and sorting. See API Reference. Q: How do I monitor async jobs? (New in v0.4.0) A: Use start_parse_job() for better control and access to SGP traces. See API Reference.