Best Practices - Scale AI

30-Second Quick Start

import os, asyncio
from pydantic import BaseModel, Field
from dex_sdk import DexClient
from dex_sdk.types import ParseEngine, ReductoParseJobParams, ReductoChunkingMethod, ReductoChunkingOptions, ReductoParseEngineOptions

class Schema(BaseModel):
    field: str = Field(description="Description")

async def main():
    client = DexClient(base_url="https://dex.sgp.scale.com", api_key=os.getenv("SGP_API_KEY"), account_id=os.getenv("SGP_ACCOUNT_ID"))
    project = await client.create_project(name="My Project")
    file = await project.upload_file("doc.pdf")
    parsed = await file.parse(ReductoParseJobParams(engine=ParseEngine.REDUCTO, options=ReductoParseEngineOptions(chunking=ReductoChunkingOptions(chunk_mode=ReductoChunkingMethod.VARIABLE))))
    result = await parsed.extract(extraction_schema=Schema, user_prompt="Extract data", model="openai/gpt-4o")
    print(result.result.data)

asyncio.run(main())

Synchronous Client

For synchronous workflows, use DexSyncClient from dex_sdk:

from dex_sdk import DexSyncClient

sync_client = DexSyncClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# All methods work synchronously
project = sync_client.create_project(name="My Project")

Data Retention Policies

Configure automatic data lifecycle management for compliance and cost optimization.

Setting Retention Policies

from datetime import timedelta
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

# Create project with retention policy
project = await dex_client.create_project(
    name="Compliant Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),           # Files expire after 30 days
            result_artifacts=timedelta(days=7),  # Parse/extract results expire after 7 days
        )
    )
)

Updating Retention Policies

await dex_client.update_project(
    project_id=project.id,
    updates={
        "configuration": ProjectConfiguration(
            retention=RetentionPolicy(
                files=timedelta(days=90),        # Extend to 90 days
                result_artifacts=timedelta(days=30),  # Extend to 30 days
            )
        )
    }
)

Retention Policy Use Cases

Compliance (GDPR, HIPAA):

retention=RetentionPolicy(
    files=timedelta(days=2555),  # 7 years
    result_artifacts=timedelta(days=365),  # 1 year
)

Cost Optimization:

retention=RetentionPolicy(
    files=timedelta(days=30),    # Raw files for 1 month
    result_artifacts=timedelta(days=7),   # Results for 1 week
)

Security (Minimize Exposure):

retention=RetentionPolicy(
    files=timedelta(days=1),     # Delete files after 1 day
    result_artifacts=timedelta(hours=24),  # Delete results after 24 hours
)

Optimizing Extraction Accuracy

Write Clear Prompts

# Good: Specific and detailed
user_prompt = """Extract the following fields:
- Invoice number: Find in top-right corner or header
- Total amount: The final amount including tax
- Date: Invoice date, not payment date"""

# Bad: Vague
user_prompt = "Extract invoice info"

Design Good Schemas

# Good: Descriptive fields with clear types
class InvoiceData(BaseModel):
    invoice_number: str = Field(description="The invoice number (format: INV-XXXXX)")
    total_amount: float = Field(description="Total amount in USD, including tax")
    invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format")

# Bad: No descriptions
class InvoiceData(BaseModel):
    number: str
    amount: float
    date: str

Enable Citations and Confidence
- Always set generate_citations=True for debugging
- Use generate_confidence=True to filter low-confidence results
Use Vector Stores for Large Documents
- Documents > 50 pages benefit from RAG-enhanced extraction
- Vector stores improve accuracy for cross-document queries

Getting SGP Traces for Extraction

Use start_extract_job instead of extract to get async jobs that are linked to SGP traces. Search for traces by job ID to debug extraction latency, token usage, or failures.

import os
from scale_gp_beta import SGPClient
from dex_sdk.types import ExtractionParameters

# Start extract job (returns immediately)
extract_job = await parse_result.start_extract_job(
    ExtractionParameters(
        model="openai/gpt-4o",
        extraction_schema=MySchema.model_json_schema(),
        user_prompt="Extract the data",
    )
)

# Wait for completion
result = await extract_job.get_result()

# Retrieve SGP traces for the extract job
sgp_client = SGPClient(
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

spans = list(sgp_client.spans.search(
    sort_by="created_at",
    sort_order="desc",
    extra_metadata={"job_id": extract_job.data.id},
    parents_only=True,
))

if spans:
    trace_id = spans[0].trace_id
    all_spans = list(sgp_client.spans.search(trace_ids=[trace_id]))
    for span in all_spans:
        print(f"Span: {span.name}, Duration: {span.duration_ms}ms")

Common Workflows

Single Document Processing:

Upload → Parse → Extract → Review Results

Multi-Document Analysis:

Upload All → Parse All → Create Vector Store → Add to Store → Extract → Aggregate

Iterative Refinement:

Parse → Review → Adjust Chunking/Engine → Re-parse → Extract

Production Pipeline:

Upload → Parse → Store in Vector DB → Extract on Demand → Cache Results

Performance Optimization

Reduce Latency

Use appropriate chunking: Smaller chunks = faster parsing
Limit OCR scope: Only process pages you need
Batch operations: Process multiple files in parallel
Cache parse results: Reuse parsed documents for multiple extractions

Reduce Costs

Set retention policies: Auto-delete old data
Choose right model: Use smaller models when possible
Optimize prompts: Shorter prompts = lower token costs
Filter before extraction: Use vector search to find relevant chunks first

Common Errors & Quick Fixes

Error	Cause	Quick Fix
`AuthenticationError`	Missing/invalid credentials	Check `SGP_API_KEY` and `SGP_ACCOUNT_ID` env vars
`FileUploadError`	Unsupported format or too large	Check file type, reduce size
`ParsingError`	OCR failure	Try different engine or check document quality
`ExtractionError`	Invalid schema	Validate Pydantic model, check field types
`ConnectionError`	Network issues	Check internet connection, verify base URL
`RateLimitError`	Too many requests	Implement backoff/retry, reduce concurrency

Access Response Data

Remember: SDK methods return wrapper objects, access data via .data:

# Correct
project = await client.create_project(name="Test")
project_id = project.data.id
project_name = project.data.name

# Incorrect
project_id = project.id  # Won't work

Common Response Attributes:

project.data.id                                 # Project ID
project.data.name                               # Project name
project.data.created_at                         # Creation timestamp

dex_file.data.id                                # File ID
dex_file.data.filename                          # Filename
dex_file.data.size_bytes                        # File size

parse_result.data.id                            # Parse result ID
parse_result.data.engine                        # Engine used
parse_result.data.parse_metadata.pages_processed  # Pages count

extract_result.result.data                      # Extracted data dict
extract_result.result.usage_info.total_tokens   # Token usage

Type Import Quick Lookup

Type	Import From	Used For
`DexClient`	`dex_sdk`	Client initialization
`DexSyncClient`	`dex_sdk`	Sync client
`ParseEngine`	`dex_sdk.types`	OCR engine selection
`ReductoParseJobParams`	`dex_sdk.types`	Reducto configuration
`IrisParseJobParams`	`dex_sdk.types`	Iris configuration
`ReductoChunkingMethod`	`dex_sdk.types`	Chunking method enum
`ReductoChunkingOptions`	`dex_sdk.types`	Chunking config
`ReductoParseEngineOptions`	`dex_sdk.types`	Parser options
`IrisParseEngineOptions`	`dex_sdk.types`	Iris parser options
`ExtractionParameters`	`dex_sdk.types`	Extraction config
`VectorStoreEngines`	`dex_sdk.types`	Vector store engines
`VectorStoreSearchResult`	`dex_sdk.types`	Search results
`ProjectConfiguration`	`dex_sdk.types`	Project config
`RetentionPolicy`	`dex_sdk.types`	Data retention
`PaginationParams`	`dex_sdk.types`	Pagination config
`FileListFilter`	`dex_sdk.types`	File filtering
`JobListFilter`	`dex_sdk.types`	Job filtering
`ParseResultListFilter`	`dex_sdk.types`	Parse result filtering
`ExtractionListFilter`	`dex_sdk.types`	Extraction filtering
`VectorStoreListFilter`	`dex_sdk.types`	Vector store filtering
`JobStatus`	`dex_sdk.types`	Job status enum

Response Types (accessed via .data, no import needed):

ProjectEntity, FileEntity, ParseResultEntity, ExtractionEntity, VectorStoreEntity

Next Steps

Getting Started: Step-by-step tutorial
File Management: Upload and manage files
API Reference: Complete SDK documentation
Troubleshooting: Detailed error solutions

Getting Started

Document Understanding

OCR

Workflows

​30-Second Quick Start

​Synchronous Client

​Data Retention Policies

​Setting Retention Policies

​Updating Retention Policies

​Retention Policy Use Cases

​Optimizing Extraction Accuracy

​Getting SGP Traces for Extraction

​Common Workflows

​Performance Optimization

​Reduce Latency

​Reduce Costs

​Common Errors & Quick Fixes

​Access Response Data

​Type Import Quick Lookup

​Next Steps