Quick Reference - Scale AI

Quick reference guide for common Dex patterns, imports, and operations.

30-Second Quick Start

import os, asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import ParseEngine, ReductoParseJobParams, ReductoChunkingMethod, ReductoChunkingOptions, ReductoParseEngineOptions

class Schema(BaseModel):
    field: str = Field(description="Description")

async def main():
    client = DexClient(base_url="https://dex.sgp.scale.com", api_key=os.getenv("SGP_API_KEY"), account_id=os.getenv("SGP_ACCOUNT_ID"))
    project = await client.create_project(name="My Project")
    file = await project.upload_file("doc.pdf")
    parsed = await file.parse(ReductoParseJobParams(engine=ParseEngine.REDUCTO, options=ReductoParseEngineOptions(chunking=ReductoChunkingOptions(chunk_mode=ReductoChunkingMethod.VARIABLE))))
    result = await parsed.extract(extraction_schema=Schema, user_prompt="Extract data", model="openai/gpt-4o")
    print(result.result.data)

asyncio.run(main())

Essential Imports

Basic Setup

import os
from dex_sdk import DexClient

Parsing

from dex_sdk.types import (
    ParseEngine,                    # OCR engine selection
    ReductoParseJobParams,          # Reducto parser config
    ReductoChunkingMethod,          # Chunking methods enum
    ReductoChunkingOptions,         # Chunking configuration
    ReductoParseEngineOptions,      # Parser options
    IrisParseJobParams,             # Iris parser config
    IrisParseEngineOptions,         # Iris options
)

Extraction

from pydantic import BaseModel, Field
from dex_sdk.types import ExtractionParameters  # Advanced extraction config

Vector Stores

from dex_sdk.types import (
    VectorStoreEngines,             # Vector store engine enum
    VectorStoreSearchResult,        # Search result type
)

Projects & Configuration

from dex_sdk.types import (
    ProjectConfiguration,           # Project config
    RetentionPolicy,                # Data retention settings
)
from datetime import timedelta      # For retention periods

Common Operations Cheat Sheet

Initialize Client

dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

Create Project

project = await dex_client.create_project(name="My Project")

Create Project with Retention

project = await dex_client.create_project(
    name="My Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),
            result_artifacts=timedelta(days=7),
        )
    )
)

Upload File

dex_file = await project.upload_file("document.pdf")

Parse Document (Default)

parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

Extract Data

extract_result = await parse_result.extract(
    extraction_schema=MySchema,
    user_prompt="Extract the data",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

Create Vector Store

vector_store = await project.create_vector_store(
    name="My Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

Add to Vector Store

await vector_store.add_parse_results([parse_result.id])

Search Vector Store

results = await vector_store.search(query="search query", top_k=5)

Extract from Vector Store

result = await vector_store.extract(
    extraction_schema=MySchema,
    user_prompt="Extract from all documents",
    model="openai/gpt-4o",
)

Type Import Quick Lookup

Type	Import From	Used For
`DexClient`	`dex_sdk`	Client initialization
`DexSyncClient`	`dex_sdk`	Sync client
`ParseEngine`	`dex_sdk.types`	OCR engine selection
`ReductoParseJobParams`	`dex_sdk.types`	Reducto configuration
`IrisParseJobParams`	`dex_sdk.types`	Iris configuration
`ReductoChunkingMethod`	`dex_sdk.types`	Chunking method enum
`ReductoChunkingOptions`	`dex_sdk.types`	Chunking config
`ReductoParseEngineOptions`	`dex_sdk.types`	Parser options
`IrisParseEngineOptions`	`dex_sdk.types`	Iris parser options
`ExtractionParameters`	`dex_sdk.types`	Extraction config
`VectorStoreEngines`	`dex_sdk.types`	Vector store engines
`VectorStoreSearchResult`	`dex_sdk.types`	Search results
`ProjectConfiguration`	`dex_sdk.types`	Project config
`RetentionPolicy`	`dex_sdk.types`	Data retention

Response Types (accessed via .data, no import needed):

ProjectEntity, FileEntity, ParseResultEntity, ExtractionEntity, VectorStoreEntity

Comparison Tables

OCR Engine Comparison

Engine	Best For	Languages	Speed	Accuracy
Reducto	English & Latin-script documents, tables, figures	Latin scripts + 35+ languages	Fast	High
Iris	Non-English, non-Latin scripts (Arabic, Hebrew, CJK, Indic)	Non-Latin scripts	Medium	Very High
Custom	Domain-specific needs	Any	Varies	Varies

Recommendation: Use Reducto for English and Latin-script documents. Use Iris for non-Latin scripts (Arabic, Chinese, Japanese, Korean, Hebrew, Thai, Hindi, etc.).

Chunking Method Comparison

Method	Chunk Size	Best For	Embedding	Location Tracking
VARIABLE	Auto (optimal)	General use, embeddings	✅ Excellent	⚠️ Good
BLOCK	Small (~100-500 chars)	Precise locations, UI overlays	❌ Too small	✅ Excellent
SECTION	Medium (~1000-3000 chars)	Structured documents	✅ Good	✅ Good
PAGE	Large (full page)	Page-oriented docs	⚠️ May be large	✅ Excellent
PAGE_SECTIONS	Medium-Large	Hybrid needs	✅ Good	✅ Good
DISABLED	Very large (entire doc)	Special cases	❌ Too large	✅ Excellent

Recommendation: Use VARIABLE for most cases, especially with embeddings.

LLM Model Comparison

Model	Speed	Cost	Best For
`openai/gpt-4o`	Medium	$$$	Highest accuracy, complex extraction
`openai/gpt-4o-mini`	Fast	$	Good accuracy, simpler extraction
`anthropic/claude-3.5-sonnet`	Medium	$$$	Complex reasoning, long context

Recommendation: Start with gpt-4o, optimize to gpt-4o-mini if accuracy is sufficient.

Common Patterns

Pattern: Process Multiple Files

files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
dex_files = await asyncio.gather(*[project.upload_file(f) for f in files])
parsed = await asyncio.gather(*[f.parse(parse_params) for f in dex_files])

Pattern: Extract with Citations

result = await parse_result.extract(
    extraction_schema=Schema,
    user_prompt="Extract data",
    model="openai/gpt-4o",
    generate_citations=True,  # Get source locations
)

for field_name, field in result.result.data.items():
    if field.citations:
        print(f"{field_name} found on page {field.citations[0].page}")

Pattern: Filter Low Confidence Results

result = await parse_result.extract(
    extraction_schema=Schema,
    user_prompt="Extract data",
    model="openai/gpt-4o",
    generate_confidence=True,
)

high_confidence = {
    k: v for k, v in result.result.data.items()
    if v.confidence and v.confidence > 0.8
}

Pattern: RAG for Large Documents

# 1. Create vector store
vs = await project.create_vector_store(
    name="Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

# 2. Add documents
await vs.add_parse_results([parse_result.id])

# 3. Extract with context
result = await vs.extract(
    extraction_schema=Schema,
    user_prompt="Extract from all docs",
    model="openai/gpt-4o",
)

Pattern: Retry with Different Chunking

# Try 1: Variable chunking
result1 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

# If not satisfactory, try block chunking
result2 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.BLOCK,
            )
        ),
    )
)

Common Errors & Quick Fixes

Error	Cause	Quick Fix
`AuthenticationError`	Missing/invalid credentials	Check `SGP_API_KEY` and `SGP_ACCOUNT_ID` env vars
`FileUploadError`	Unsupported format or too large	Check file type, reduce size
`ParsingError`	OCR failure	Try different engine or check document quality
`ExtractionError`	Invalid schema	Validate Pydantic model, check field types
`ConnectionError`	Network issues	Check internet connection, verify base URL
`RateLimitError`	Too many requests	Implement backoff/retry, reduce concurrency

Access Response Data

Remember: SDK methods return wrapper objects, access data via .data:

# ✅ Correct
project = await client.create_project(name="Test")
project_id = project.data.id                    # Access via .data
project_name = project.data.name

# ❌ Incorrect
project_id = project.id                         # Won't work

Common Response Attributes:

project.data.id                                 # Project ID
project.data.name                               # Project name
project.data.created_at                         # Creation timestamp

dex_file.data.id                                # File ID
dex_file.data.filename                          # Filename
dex_file.data.size_bytes                        # File size

parse_result.data.id                            # Parse result ID
parse_result.data.engine                        # Engine used
parse_result.data.parse_metadata.pages_processed  # Pages count

extract_result.result.data                      # Extracted data dict
extract_result.result.usage_info.total_tokens   # Token usage

Decision Trees

When to Use Vector Stores?

Is your document > 50 pages?
├─ Yes → Use vector store
└─ No → Do you have multiple documents to query together?
   ├─ Yes → Use vector store
   └─ No → Extract directly from parse result

Which Chunking Method?

What's your primary goal?
├─ Embeddings/Semantic Search → VARIABLE
├─ Precise bounding boxes/UI → BLOCK
├─ Page-by-page processing → PAGE
└─ Structured document navigation → SECTION

Which OCR Engine?

What language/script is your document?
├─ English or Latin scripts (Spanish, French, German, etc.) → Reducto
├─ Non-Latin scripts (Arabic, Hebrew, CJK, Indic, Thai, etc.) → Iris
├─ Domain-specific (medical, legal with custom needs) → Custom
└─ Not sure → Start with Reducto for Latin scripts, Iris for non-Latin

Next Steps

Getting Started: Step-by-step tutorial
Advanced Features: Vector stores, batch processing, optimization
API Reference: Complete SDK documentation
Troubleshooting: Detailed error solutions

Getting Started

Document Understanding

OCR

Workflows

​30-Second Quick Start

​Essential Imports

​Basic Setup

​Parsing

​Extraction

​Vector Stores

​Projects & Configuration

​Common Operations Cheat Sheet

​Initialize Client

​Create Project

​Create Project with Retention

​Upload File

​Parse Document (Default)

​Extract Data

​Create Vector Store

​Add to Vector Store

​Search Vector Store

​Extract from Vector Store

​Type Import Quick Lookup

​Comparison Tables

​OCR Engine Comparison

​Chunking Method Comparison

​LLM Model Comparison

​Common Patterns

​Pattern: Process Multiple Files

​Pattern: Extract with Citations

​Pattern: Filter Low Confidence Results

​Pattern: RAG for Large Documents

​Pattern: Retry with Different Chunking

​Common Errors & Quick Fixes

​Access Response Data

​Decision Trees

​When to Use Vector Stores?

​Which Chunking Method?

​Which OCR Engine?

​Next Steps