Skip to main content
Quick reference guide for common Dex patterns, imports, and operations.

30-Second Quick Start

import os, asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import ParseEngine, ReductoParseJobParams, ReductoChunkingMethod, ReductoChunkingOptions, ReductoParseEngineOptions

class Schema(BaseModel):
    field: str = Field(description="Description")

async def main():
    client = DexClient(base_url="https://dex.sgp.scale.com", api_key=os.getenv("SGP_API_KEY"), account_id=os.getenv("SGP_ACCOUNT_ID"))
    project = await client.create_project(name="My Project")
    file = await project.upload_file("doc.pdf")
    parsed = await file.parse(ReductoParseJobParams(engine=ParseEngine.REDUCTO, options=ReductoParseEngineOptions(chunking=ReductoChunkingOptions(chunk_mode=ReductoChunkingMethod.VARIABLE))))
    result = await parsed.extract(extraction_schema=Schema, user_prompt="Extract data", model="openai/gpt-4o")
    print(result.result.data)

asyncio.run(main())

Essential Imports

Basic Setup

import os
from dex_sdk import DexClient

Parsing

from dex_sdk.types import (
    ParseEngine,                    # OCR engine selection
    ReductoParseJobParams,          # Reducto parser config
    ReductoChunkingMethod,          # Chunking methods enum
    ReductoChunkingOptions,         # Chunking configuration
    ReductoParseEngineOptions,      # Parser options
    IrisParseJobParams,             # Iris parser config
    IrisParseEngineOptions,         # Iris options
)

Extraction

from pydantic import BaseModel, Field
from dex_sdk.types import ExtractionParameters  # Advanced extraction config

Vector Stores

from dex_sdk.types import (
    VectorStoreEngines,             # Vector store engine enum
    VectorStoreSearchResult,        # Search result type
)

Projects & Configuration

from dex_sdk.types import (
    ProjectConfiguration,           # Project config
    RetentionPolicy,                # Data retention settings
)
from datetime import timedelta      # For retention periods

Common Operations Cheat Sheet

Initialize Client

dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

Create Project

project = await dex_client.create_project(name="My Project")

Create Project with Retention

project = await dex_client.create_project(
    name="My Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),
            result_artifacts=timedelta(days=7),
        )
    )
)

Upload File

dex_file = await project.upload_file("document.pdf")

Parse Document (Default)

parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

Extract Data

extract_result = await parse_result.extract(
    extraction_schema=MySchema,
    user_prompt="Extract the data",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

Create Vector Store

vector_store = await project.create_vector_store(
    name="My Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

Add to Vector Store

await vector_store.add_parse_results([parse_result.id])

Search Vector Store

results = await vector_store.search(query="search query", top_k=5)

Extract from Vector Store

result = await vector_store.extract(
    extraction_schema=MySchema,
    user_prompt="Extract from all documents",
    model="openai/gpt-4o",
)

Type Import Quick Lookup

TypeImport FromUsed For
DexClientdex_sdkClient initialization
DexSyncClientdex_sdkSync client
ParseEnginedex_sdk.typesOCR engine selection
ReductoParseJobParamsdex_sdk.typesReducto configuration
IrisParseJobParamsdex_sdk.typesIris configuration
ReductoChunkingMethoddex_sdk.typesChunking method enum
ReductoChunkingOptionsdex_sdk.typesChunking config
ReductoParseEngineOptionsdex_sdk.typesParser options
IrisParseEngineOptionsdex_sdk.typesIris parser options
ExtractionParametersdex_sdk.typesExtraction config
VectorStoreEnginesdex_sdk.typesVector store engines
VectorStoreSearchResultdex_sdk.typesSearch results
ProjectConfigurationdex_sdk.typesProject config
RetentionPolicydex_sdk.typesData retention
Response Types (accessed via .data, no import needed):
  • ProjectEntity, FileEntity, ParseResultEntity, ExtractionEntity, VectorStoreEntity

Comparison Tables

OCR Engine Comparison

EngineBest ForLanguagesSpeedAccuracy
ReductoEnglish & Latin-script documents, tables, figuresLatin scripts + 35+ languagesFastHigh
IrisNon-English, non-Latin scripts (Arabic, Hebrew, CJK, Indic)Non-Latin scriptsMediumVery High
CustomDomain-specific needsAnyVariesVaries
Recommendation: Use Reducto for English and Latin-script documents. Use Iris for non-Latin scripts (Arabic, Chinese, Japanese, Korean, Hebrew, Thai, Hindi, etc.).

Chunking Method Comparison

MethodChunk SizeBest ForEmbeddingLocation Tracking
VARIABLEAuto (optimal)General use, embeddings✅ Excellent⚠️ Good
BLOCKSmall (~100-500 chars)Precise locations, UI overlays❌ Too small✅ Excellent
SECTIONMedium (~1000-3000 chars)Structured documents✅ Good✅ Good
PAGELarge (full page)Page-oriented docs⚠️ May be large✅ Excellent
PAGE_SECTIONSMedium-LargeHybrid needs✅ Good✅ Good
DISABLEDVery large (entire doc)Special cases❌ Too large✅ Excellent
Recommendation: Use VARIABLE for most cases, especially with embeddings.

LLM Model Comparison

ModelSpeedCostBest For
openai/gpt-4oMedium$$$Highest accuracy, complex extraction
openai/gpt-4o-miniFast$Good accuracy, simpler extraction
anthropic/claude-3.5-sonnetMedium$$$Complex reasoning, long context
Recommendation: Start with gpt-4o, optimize to gpt-4o-mini if accuracy is sufficient.

Common Patterns

Pattern: Process Multiple Files

files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
dex_files = await asyncio.gather(*[project.upload_file(f) for f in files])
parsed = await asyncio.gather(*[f.parse(parse_params) for f in dex_files])

Pattern: Extract with Citations

result = await parse_result.extract(
    extraction_schema=Schema,
    user_prompt="Extract data",
    model="openai/gpt-4o",
    generate_citations=True,  # Get source locations
)

for field_name, field in result.result.data.items():
    if field.citations:
        print(f"{field_name} found on page {field.citations[0].page}")

Pattern: Filter Low Confidence Results

result = await parse_result.extract(
    extraction_schema=Schema,
    user_prompt="Extract data",
    model="openai/gpt-4o",
    generate_confidence=True,
)

high_confidence = {
    k: v for k, v in result.result.data.items()
    if v.confidence and v.confidence > 0.8
}

Pattern: RAG for Large Documents

# 1. Create vector store
vs = await project.create_vector_store(
    name="Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

# 2. Add documents
await vs.add_parse_results([parse_result.id])

# 3. Extract with context
result = await vs.extract(
    extraction_schema=Schema,
    user_prompt="Extract from all docs",
    model="openai/gpt-4o",
)

Pattern: Retry with Different Chunking

# Try 1: Variable chunking
result1 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

# If not satisfactory, try block chunking
result2 = await file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.BLOCK,
            )
        ),
    )
)

Common Errors & Quick Fixes

ErrorCauseQuick Fix
AuthenticationErrorMissing/invalid credentialsCheck SGP_API_KEY and SGP_ACCOUNT_ID env vars
FileUploadErrorUnsupported format or too largeCheck file type, reduce size
ParsingErrorOCR failureTry different engine or check document quality
ExtractionErrorInvalid schemaValidate Pydantic model, check field types
ConnectionErrorNetwork issuesCheck internet connection, verify base URL
RateLimitErrorToo many requestsImplement backoff/retry, reduce concurrency

Access Response Data

Remember: SDK methods return wrapper objects, access data via .data:
# ✅ Correct
project = await client.create_project(name="Test")
project_id = project.data.id                    # Access via .data
project_name = project.data.name

# ❌ Incorrect
project_id = project.id                         # Won't work
Common Response Attributes:
project.data.id                                 # Project ID
project.data.name                               # Project name
project.data.created_at                         # Creation timestamp

dex_file.data.id                                # File ID
dex_file.data.filename                          # Filename
dex_file.data.size_bytes                        # File size

parse_result.data.id                            # Parse result ID
parse_result.data.engine                        # Engine used
parse_result.data.parse_metadata.pages_processed  # Pages count

extract_result.result.data                      # Extracted data dict
extract_result.result.usage_info.total_tokens   # Token usage

Decision Trees

When to Use Vector Stores?

Is your document > 50 pages?
├─ Yes → Use vector store
└─ No → Do you have multiple documents to query together?
   ├─ Yes → Use vector store
   └─ No → Extract directly from parse result

Which Chunking Method?

What's your primary goal?
├─ Embeddings/Semantic Search → VARIABLE
├─ Precise bounding boxes/UI → BLOCK
├─ Page-by-page processing → PAGE
└─ Structured document navigation → SECTION

Which OCR Engine?

What language/script is your document?
├─ English or Latin scripts (Spanish, French, German, etc.) → Reducto
├─ Non-Latin scripts (Arabic, Hebrew, CJK, Indic, Thai, etc.) → Iris
├─ Domain-specific (medical, legal with custom needs) → Custom
└─ Not sure → Start with Reducto for Latin scripts, Iris for non-Latin

Next Steps