Skip to main content
This guide will help you get started with Scale’s Dex service - a document understanding capability that extracts accurate, structured information from unstructured documents.

Overview

Dex is Scale’s document understanding capability that provides composable primitives for:
  • File Management - Secure file upload, storage, and retrieval with fine-grained access control
  • Document Parsing - Convert any document (PDFs, DOCX, images, etc.) into structured JSON format with multiple OCR engines
  • Vector Stores - Index and search parsed documents with semantic embeddings
  • Data Extraction - Extract specific information using custom schemas, prompts, and RAG-enhanced context
  • Project Management - Organize and isolate data with proper credential management and authorization

Prerequisites

Before using this repository, ensure you have:
  • ✅ A valid Scale account with SGP (Scale General Platform) access
  • ✅ Your SGP account ID and API key set as environment variables:
export SGP_ACCOUNT_ID="your_account_id"
export SGP_API_KEY="your_api_key"
  • ✅ VPN connection to Scale’s internal network via all-traffic (not eng-split-prod)
  • ✅ Python 3.8+ installed
  • ✅ Required Python packages (see Installation section)

Installation

Install Dex SDK from CodeArtifact

With access to Scale CodeArtifact, install the Dex SDK:
pip install sdk/dex_core-xxx.whl sdk/dex_sdk-xxx.whl
This will install all required dependencies including Pydantic.

Quick Start

1. Initialize Dex Client

import os
from datetime import datetime
from pydantic import BaseModel, Field

# Import the Dex SDK
from dex_sdk import DexClient
from dex_sdk.types import (
    ProjectCredentials, SGPCredentials,
    ParseEngine,
    ParseJobRequestParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
    ExtractionParameters,
)

# Initialize the Dex client
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
)

2. Create a Project

Projects isolate your data and credentials for tracing, billing, and SGP model calls. Every operation is tied to a project. Key principle: Keep one project per use case, or group of files for clean traceability.
# Create a project with SGP credentials
project = await dex_client.create_project(
    name="My Dex Project",
    credentials=ProjectCredentials(
        sgp=SGPCredentials(
            account_id=os.getenv("SGP_ACCOUNT_ID"),
            api_key=os.getenv("SGP_API_KEY"),
        ),
    ),
)

3. Upload a Document

Dex manages your files with persistent storage. Upload files directly and always store the returned file object — you’ll need it for parsing. Supported file types:
  • Images: PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC
  • PDFs: PDF (Portable Document Format)
  • Spreadsheets: CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW
  • Documents: PPTX, PPT, DOCX, DOC, DOTX, WPD, TXT, RTF
# Upload a file to the project
dex_file = await project.upload_file("path/to/your/document.pdf")
print(f"✅ File uploaded successfully! File ID: {dex_file.id}")

# List all files in the project
file_list = await project.list_files()
print(f"✅ File list: {file_list}")

4. Parse the Document

Parsing normalizes any document into a unified JSON format with:
  • Chunks – chunked content optimized for embedding
  • Blocks – semantically separated contents with types like Title, Subtitle, Text, Figure, etc.
  • Bounding boxes – page position info for UI overlays
OCR Engines:
  • Default: Reducto
  • Swappable: you can bring your own OCR engine
Note: Parsing is an asynchronous process. The SDK handles the job monitoring for you.
from dex_sdk.types import (
    ParseEngine,
    ParseJobRequestParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Parse the document
parse_result = await dex_file.parse(
    ParseJobRequestParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

5. Extract Structured Data

Dex provides a highly customizable extraction pipeline for high accuracy results.
from pydantic import BaseModel, Field
from dex_sdk.types import ExtractionParameters

# Define your extraction schema using Pydantic
class FinancialData(BaseModel):
    """Schema for financial data extraction"""
    taxable_this_period: float = Field(description="Taxable income for this period in dollars")
    tax_exempt_this_period: float = Field(description="Tax-exempt income for this period in dollars")
    # ... add more fields as needed

# Extract data using prompts and schema
system_prompt = "You are a helpful assistant that extracts financial data from documents with high accuracy."
user_prompt = """
From the provided text, extract the following:
1. **Income Summary**
   - Taxable income for this period
   - Tax-exempt income for this period
   - ... add more extraction instructions
"""

# Extract data using the ExtractionParameters class
extract_result = await parse_result.extract(
    ExtractionParameters(
        user_prompt=user_prompt,
        extraction_schema=FinancialData.model_json_schema(),
        model="openai/gpt-4o",
        generate_citations=True,  # Add source citations
        generate_confidence=True,  # Add confidence scores
    )
)

print(f"✅ Extraction done")

# Access extraction result data
extraction_data = extract_result.data.model_dump()

Advanced Features

Working with Vector Stores

Vector stores enable semantic search and improved extraction for large documents or multi-file processing:
from dex_sdk.types import VectorStoreEngines

# Create a vector store for the project
vector_store = await project.create_vector_store(
    name="Financial Documents Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

# Add parsed files to the vector store
await vector_store.add_parse_results([parse_result.id])
print("Parse results added to vector store")

# Perform semantic search
search_results = await vector_store.search(
    query="What is the total taxable income?",
    top_k=5,
)

# Search within a specific file
file_search_results = await vector_store.search_in_file(
    file_id=dex_file.id,
    query="What is the total taxable income?",
    top_k=5,
    filters=None,  # Optional filters
)

# Extract from vector store with RAG context
extract_result = await vector_store.extract(
    ExtractionParameters(
        user_prompt=user_prompt,
        extraction_schema=FinancialData.model_json_schema(),
        model="openai/gpt-4o",
        generate_citations=True,
        generate_confidence=True,
    )
)

Chunking Strategies

Choose the right chunking strategy for your use case (BLOCK, VARIABLE):
from dex_sdk.types import ChunkingStrategy

# Block-level chunking (fine-grained)
parse_result = await dex_file.parse(
    ParseJobRequestParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
                strategy=ChunkingStrategy.BLOCK,
            )
        ),
    )
)

Multi-language Support

Dex supports multiple languages out of the box:

parse_result = await dex_file.parse(
        ParseJobRequestParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                ),
                extraction_mode = "ocr",
            ),
        )
    )

Troubleshooting

Common Issues

1. VPN Connection Problems
# Test VPN connection
import requests

try:
    response = requests.get("https://dex.sgp.scale.com/health")
    if response.status_code == 200:
        print("✅ Connected to Dex")
    else:
        print("❌ Connection issue - check VPN")
except Exception as e:
    print(f"❌ Cannot reach Dex: {e}")
    print("Connect to Scale VPN and try again")
2. Authentication Errors
# Verify credentials are set
import os

sgp_account_id = os.getenv("SGP_ACCOUNT_ID")
sgp_api_key = os.getenv("SGP_API_KEY")

if not sgp_account_id or not sgp_api_key:
    print("❌ Missing credentials")
    print("Set SGP_ACCOUNT_ID and SGP_API_KEY environment variables")
else:
    print("✅ Credentials found")
Common causes:
  • Missing or incorrect SGP_ACCOUNT_ID and SGP_API_KEY
  • Insufficient permissions on Scale account
  • Account doesn’t have SGP access
3. File Upload Issues
# Check file before upload
import os

file_path = "document.pdf"

if not os.path.exists(file_path):
    print(f"❌ File not found: {file_path}")
elif os.path.getsize(file_path) > 100 * 1024 * 1024:  # 100MB
    print(f"❌ File too large: {os.path.getsize(file_path)} bytes")
else:
    print(f"✅ File ready for upload")
Supported formats:
  • Images: PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC
  • PDFs: PDF (Portable Document Format)
  • Spreadsheets: CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW
  • Documents: PPTX, PPT, DOCX, DOC, DOTX, WPD, TXT, RTF

Ways to Interact with Dex

The Python SDK provides a high-level, developer-friendly interface:
from dex_sdk import DexClient

# Async client
dex_client = DexClient(base_url="https://dex.sgp.scale.com")

# Sync client (for simpler scripts)
from dex_sdk import DexSyncClient
dex_sync_client = DexSyncClient(base_url="https://dex.sgp.scale.com")

REST API

Direct API access for custom integrations:
# Create a project
curl -X POST https://dex.sgp.scale.com/v1/projects \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"name": "My Project", "credentials": {...}}'

# Upload a file
curl -X POST https://dex.sgp.scale.com/v1/projects/{project_id}/files \
  -H "Authorization: Bearer $API_KEY" \
  -F "file=@document.pdf"

# Parse a document
curl -X POST https://dex.sgp.scale.com/v1/projects/{project_id}/parse \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"file_id": "...", "engine": "reducto", ...}'

API Reference

For complete SDK documentation including all methods, parameters, and examples, see the Dex SDK API Reference. Quick reference:
  • DexClient: Project management (create_project, list_projects)
  • Project: File operations, vector store operations
  • DexFile: Document parsing (parse)
  • ParseResult: Data extraction (extract, extract_async)
  • VectorStore: Indexing, semantic search, RAG-enhanced extraction

Best Practices

Choosing the Right Chunking Strategy

  • Block-level: Use for fine-grained analysis when you need precise location information. Not recommended for embeddings due to small chunk size.
  • Section-level: Best for most use cases. Provides semantic grouping based on document structure (titles, subtitles).
  • Page-level: Ideal for page-by-page analysis and when document structure is page-oriented.

Optimizing Extraction Accuracy

  1. Use clear, specific prompts: Provide detailed instructions about what to extract and how to format it in the user_prompt.
  2. Design good schemas: Use descriptive field names and detailed Field() descriptions in your Pydantic models.
  3. Convert schema properly: Always use YourModel.model_json_schema() when passing to extraction_schema.
  4. Enable citations: Set generate_citations=True to help with debugging and provide auditability.
  5. Enable confidence scores: Set generate_confidence=True to filter low-confidence results for human review.
  6. Use vector stores for large documents: RAG-enhanced extraction improves accuracy for documents exceeding context windows.
  7. Let the SDK handle async: The SDK automatically handles job monitoring for both parsing and extraction.

Common Workflows

Single Document Processing:
  1. Upload file → Parse → Extract → Get results
Multi-Document Analysis:
  1. Upload files → Parse all → Create vector store → Add to store → Extract from store
Iterative Refinement:
  1. Parse with default settings → Review results → Adjust chunking/OCR → Re-parse → Extract
Agentic Extraction:
  1. Define custom MCP tools → Parse documents → Extract with agent tools → Get enriched results

Contributing

When adding new test cases or examples:
  1. Follow the existing notebook structure
  2. Include clear documentation and comments
  3. Test with various document types
  4. Update this README with new features or examples

Next Steps

Now that you’re familiar with Dex basics, explore these advanced topics:

Learn More

Advanced Topics

  • Custom OCR Integration: Integrate your own OCR engine for specialized documents
  • Agentic Extraction: Use custom MCP tools for complex extraction workflows
  • Multi-lingual Processing: Process documents in multiple languages with language-specific OCR
  • Fine-grained Access Control: Implement ReBAC for secure document processing

Support

For issues or questions:
  • Documentation: Check the troubleshooting section above
  • Slack Support: Contact the Dex team at #sgp-document-understanding-capability
  • Feature Requests: Reach out to the Dex team with your use case

Useful Resources

  • Dex Service Documentation: Internal API docs and architecture
  • SGP Platform Docs: Scale General Platform documentation
  • Example Notebooks: Sample code and workflows

License

This repository is for internal Scale use only.
I