Skip to main content

30-Second Quick Start

import os, asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import ParseEngine, ReductoParseJobParams, ReductoChunkingMethod, ReductoChunkingOptions, ReductoParseEngineOptions

class Schema(BaseModel):
    field: str = Field(description="Description")

async def main():
    client = DexClient(base_url="https://dex.sgp.scale.com", api_key=os.getenv("SGP_API_KEY"), account_id=os.getenv("SGP_ACCOUNT_ID"))
    project = await client.create_project(name="My Project")
    file = await project.upload_file("doc.pdf")
    parsed = await file.parse(ReductoParseJobParams(engine=ParseEngine.REDUCTO, options=ReductoParseEngineOptions(chunking=ReductoChunkingOptions(chunk_mode=ReductoChunkingMethod.VARIABLE))))
    result = await parsed.extract(extraction_schema=Schema, user_prompt="Extract data", model="openai/gpt-4o")
    print(result.result.data)

asyncio.run(main())

Synchronous Client

For synchronous workflows, use DexSyncClient from dex_sdk:
from dex_sdk import DexSyncClient

sync_client = DexSyncClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# All methods work synchronously
project = sync_client.create_project(name="My Project")

Data Retention Policies

Configure automatic data lifecycle management for compliance and cost optimization.

Setting Retention Policies

from datetime import timedelta
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

# Create project with retention policy
project = await dex_client.create_project(
    name="Compliant Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),           # Files expire after 30 days
            result_artifacts=timedelta(days=7),  # Parse/extract results expire after 7 days
        )
    )
)

Updating Retention Policies

await dex_client.update_project(
    project_id=project.id,
    updates={
        "configuration": ProjectConfiguration(
            retention=RetentionPolicy(
                files=timedelta(days=90),        # Extend to 90 days
                result_artifacts=timedelta(days=30),  # Extend to 30 days
            )
        )
    }
)

Retention Policy Use Cases

Compliance (GDPR, HIPAA):
retention=RetentionPolicy(
    files=timedelta(days=2555),  # 7 years
    result_artifacts=timedelta(days=365),  # 1 year
)
Cost Optimization:
retention=RetentionPolicy(
    files=timedelta(days=30),    # Raw files for 1 month
    result_artifacts=timedelta(days=7),   # Results for 1 week
)
Security (Minimize Exposure):
retention=RetentionPolicy(
    files=timedelta(days=1),     # Delete files after 1 day
    result_artifacts=timedelta(hours=24),  # Delete results after 24 hours
)

Optimizing Extraction Accuracy

  1. Write Clear Prompts
    # Good: Specific and detailed
    user_prompt = """Extract the following fields:
    - Invoice number: Find in top-right corner or header
    - Total amount: The final amount including tax
    - Date: Invoice date, not payment date"""
    
    # Bad: Vague
    user_prompt = "Extract invoice info"
    
  2. Design Good Schemas
    # Good: Descriptive fields with clear types
    class InvoiceData(BaseModel):
        invoice_number: str = Field(description="The invoice number (format: INV-XXXXX)")
        total_amount: float = Field(description="Total amount in USD, including tax")
        invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format")
    
    # Bad: No descriptions
    class InvoiceData(BaseModel):
        number: str
        amount: float
        date: str
    
  3. Enable Citations and Confidence
    • Always set generate_citations=True for debugging
    • Use generate_confidence=True to filter low-confidence results
  4. Use Vector Stores for Large Documents
    • Documents > 50 pages benefit from RAG-enhanced extraction
    • Vector stores improve accuracy for cross-document queries

Getting SGP Traces for Extraction

Use start_extract_job instead of extract to get async jobs that are linked to SGP traces. Search for traces by job ID to debug extraction latency, token usage, or failures.
import os
from scale_gp_beta import SGPClient
from dex_sdk.types import ExtractionParameters

# Start extract job (returns immediately)
extract_job = await parse_result.start_extract_job(
    ExtractionParameters(
        model="openai/gpt-4o",
        extraction_schema=MySchema.model_json_schema(),
        user_prompt="Extract the data",
    )
)

# Wait for completion
result = await extract_job.get_result()

# Retrieve SGP traces for the extract job
sgp_client = SGPClient(
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

spans = list(sgp_client.spans.search(
    sort_by="created_at",
    sort_order="desc",
    extra_metadata={"job_id": extract_job.data.id},
    parents_only=True,
))

if spans:
    trace_id = spans[0].trace_id
    all_spans = list(sgp_client.spans.search(trace_ids=[trace_id]))
    for span in all_spans:
        print(f"Span: {span.name}, Duration: {span.duration_ms}ms")

Common Workflows

Single Document Processing:
Upload → Parse → Extract → Review Results
Multi-Document Analysis:
Upload All → Parse All → Create Vector Store → Add to Store → Extract → Aggregate
Iterative Refinement:
Parse → Review → Adjust Chunking/Engine → Re-parse → Extract
Production Pipeline:
Upload → Parse → Store in Vector DB → Extract on Demand → Cache Results

Performance Optimization

Reduce Latency

  1. Use appropriate chunking: Smaller chunks = faster parsing
  2. Limit OCR scope: Only process pages you need
  3. Batch operations: Process multiple files in parallel
  4. Cache parse results: Reuse parsed documents for multiple extractions

Reduce Costs

  1. Set retention policies: Auto-delete old data
  2. Choose right model: Use smaller models when possible
  3. Optimize prompts: Shorter prompts = lower token costs
  4. Filter before extraction: Use vector search to find relevant chunks first

Common Errors & Quick Fixes

ErrorCauseQuick Fix
AuthenticationErrorMissing/invalid credentialsCheck SGP_API_KEY and SGP_ACCOUNT_ID env vars
FileUploadErrorUnsupported format or too largeCheck file type, reduce size
ParsingErrorOCR failureTry different engine or check document quality
ExtractionErrorInvalid schemaValidate Pydantic model, check field types
ConnectionErrorNetwork issuesCheck internet connection, verify base URL
RateLimitErrorToo many requestsImplement backoff/retry, reduce concurrency

Access Response Data

Remember: SDK methods return wrapper objects, access data via .data:
# Correct
project = await client.create_project(name="Test")
project_id = project.data.id
project_name = project.data.name

# Incorrect
project_id = project.id  # Won't work
Common Response Attributes:
project.data.id                                 # Project ID
project.data.name                               # Project name
project.data.created_at                         # Creation timestamp

dex_file.data.id                                # File ID
dex_file.data.filename                          # Filename
dex_file.data.size_bytes                        # File size

parse_result.data.id                            # Parse result ID
parse_result.data.engine                        # Engine used
parse_result.data.parse_metadata.pages_processed  # Pages count

extract_result.result.data                      # Extracted data dict
extract_result.result.usage_info.total_tokens   # Token usage

Type Import Quick Lookup

TypeImport FromUsed For
DexClientdex_sdkClient initialization
DexSyncClientdex_sdkSync client
ParseEnginedex_sdk.typesOCR engine selection
ReductoParseJobParamsdex_sdk.typesReducto configuration
IrisParseJobParamsdex_sdk.typesIris configuration
ReductoChunkingMethoddex_sdk.typesChunking method enum
ReductoChunkingOptionsdex_sdk.typesChunking config
ReductoParseEngineOptionsdex_sdk.typesParser options
IrisParseEngineOptionsdex_sdk.typesIris parser options
ExtractionParametersdex_sdk.typesExtraction config
VectorStoreEnginesdex_sdk.typesVector store engines
VectorStoreSearchResultdex_sdk.typesSearch results
ProjectConfigurationdex_sdk.typesProject config
RetentionPolicydex_sdk.typesData retention
PaginationParamsdex_sdk.typesPagination config
FileListFilterdex_sdk.typesFile filtering
JobListFilterdex_sdk.typesJob filtering
ParseResultListFilterdex_sdk.typesParse result filtering
ExtractionListFilterdex_sdk.typesExtraction filtering
VectorStoreListFilterdex_sdk.typesVector store filtering
JobStatusdex_sdk.typesJob status enum
Response Types (accessed via .data, no import needed):
  • ProjectEntity, FileEntity, ParseResultEntity, ExtractionEntity, VectorStoreEntity

Next Steps