Advanced Features

This guide covers advanced Dex features including vector stores, chunking strategies, data retention policies, batch processing, and optimization techniques.

Vector Stores

Vector stores enable semantic search and RAG-enhanced extraction for large documents or multi-file processing.

Creating a Vector Store

from dex_sdk.types import VectorStoreEngines

# Create a vector store for the project
vector_store = await project.create_vector_store(
    name="Financial Documents Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

Adding Documents to Vector Store

# Add parsed files to the vector store
await vector_store.add_parse_results([parse_result.id])
print("Parse results added to vector store")

Semantic Search

# Perform semantic search across all documents
search_results = await vector_store.search(
    query="What is the total taxable income?",
    top_k=5,
)

# Search within a specific file
file_search_results = await vector_store.search_in_file(
    file_id=dex_file.id,
    query="What is the total taxable income?",
    top_k=5,
    filters=None,  # Optional filters like {"file_id": "file_123"}
)

# Access search results
for chunk in search_results.chunks:
    print(f"Score: {chunk.score:.2f}")
    print(f"Content: {chunk.content[:100]}...")
    print(f"File: {chunk.file_id}, Page: {chunk.blocks[0].page_number if chunk.blocks else 'N/A'}")

RAG-Enhanced Extraction

Extract data using vector store context for improved accuracy on large documents:

# Extract from vector store with RAG context
extract_result = await vector_store.extract(
    extraction_schema=FinancialData,
    user_prompt="""Extract financial data from all documents in this store.
    Focus on quarterly income statements.""",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

# Access extracted data with citations
result = extract_result.result
for field_name, field in result.data.items():
    print(f"{field_name}: {field.value} (confidence: {field.confidence:.2f})")
    if field.citations:
        for cite in field.citations:
            print(f"  → From {cite.file_id}, page {cite.page}")

Chunking Strategies

Choose the right chunking method for your use case to optimize parsing and extraction.

Available Chunking Methods

Method	Description	Best For	Embedding Suitability
`VARIABLE`	Variable-size semantic chunks	General use, embeddings	✅ Excellent
`BLOCK`	Low-level layout blocks	Precise location tracking	❌ Too small
`PAGE`	Page-by-page chunks	Page-oriented documents	⚠️ May be large
`SECTION`	Section-level (titles/subtitles)	Structured documents	✅ Good
`PAGE_SECTIONS`	Sections within pages	Hybrid approach	✅ Good
`DISABLED`	No chunking, single block	Special cases	❌ Too large

Setting Chunking Strategy

from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Variable chunking (recommended for most cases)
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
                chunk_size=None,  # Auto-determined
            )
        ),
    )
)

# Block-level chunking for precise location tracking
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.BLOCK,
            )
        ),
    )
)

# Page-level chunking
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.PAGE,
            )
        ),
    )
)

Chunking Decision Tree

Use VARIABLE when:

Creating embeddings for semantic search
General document processing
You want optimal chunk sizes automatically

Use BLOCK when:

You need precise bounding box information
Building UI overlays on documents
Not using for embeddings

Use PAGE when:

Processing page-oriented documents
Need page-level analysis
Document structure aligns with pages

Use SECTION when:

Documents have clear section structure
Want semantic grouping
Using embeddings

Data Retention Policies

Configure automatic data lifecycle management for compliance and cost optimization.

Setting Retention Policies

from datetime import timedelta
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

# Create project with retention policy
project = await dex_client.create_project(
    name="Compliant Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),           # Files expire after 30 days
            result_artifacts=timedelta(days=7),  # Parse/extract results expire after 7 days
        )
    )
)

Updating Retention Policies

# Update existing project retention policy
await dex_client.update_project(
    project_id=project.id,
    updates={
        "configuration": ProjectConfiguration(
            retention=RetentionPolicy(
                files=timedelta(days=90),        # Extend to 90 days
                result_artifacts=timedelta(days=30),  # Extend to 30 days
            )
        )
    }
)

Retention Policy Use Cases

Compliance (GDPR, HIPAA):

# Auto-delete after regulatory period
retention=RetentionPolicy(
    files=timedelta(days=2555),  # 7 years
    result_artifacts=timedelta(days=365),  # 1 year
)

Cost Optimization:

# Keep only what you need
retention=RetentionPolicy(
    files=timedelta(days=30),    # Raw files for 1 month
    result_artifacts=timedelta(days=7),   # Results for 1 week
)

Security (Minimize Exposure):

# Short retention for sensitive data
retention=RetentionPolicy(
    files=timedelta(days=1),     # Delete files after 1 day
    result_artifacts=timedelta(hours=24),  # Delete results after 24 hours
)

Batch Processing

Process multiple documents efficiently with parallel operations.

Upload Multiple Files

import asyncio

# Upload multiple files in parallel
files_to_upload = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]

upload_tasks = [project.upload_file(file_path) for file_path in files_to_upload]
dex_files = await asyncio.gather(*upload_tasks)

print(f"Uploaded {len(dex_files)} files")

Parse Multiple Documents

# Parse all files in parallel
parse_tasks = [
    dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )
    for dex_file in dex_files
]

parse_results = await asyncio.gather(*parse_tasks)
print(f"Parsed {len(parse_results)} documents")

Batch Extraction

# Extract from multiple documents
extraction_tasks = [
    parse_result.extract(
        extraction_schema=InvoiceData,
        user_prompt="Extract invoice details.",
        model="openai/gpt-4o",
    )
    for parse_result in parse_results
]

extractions = await asyncio.gather(*extraction_tasks)

# Aggregate results
for i, extraction in enumerate(extractions):
    print(f"Document {i+1}: {extraction.result.data}")

Multi-Language Support

Dex supports 35+ languages with automatic language detection.

Parsing Non-English Documents

# Parse document in any supported language
# Language is automatically detected
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            ),
            extraction_mode="ocr",  # Use OCR mode for better language support
        ),
    )
)

Supported Languages

Germanic languages have excellent support. Additional 35+ languages include:

European: Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Polish, Russian, Ukrainian
Asian: Chinese, Japanese, Korean, Thai, Vietnamese, Khmer, Lao
Middle Eastern: Arabic, Hebrew, Persian, Turkish
Indian: Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Punjabi
And many more…

See the Introduction guide for the complete list.

Sync vs Async

Choose between async (default) or sync client based on your use case.

Async Client (Recommended)

from dex_sdk import DexClient

# Use async for concurrent operations
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# All operations use await
project = await dex_client.create_project(name="My Project")

Sync Client

from dex_sdk import DexSyncClient

# Use sync for simpler scripts or interactive use
dex_sync_client = DexSyncClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# No await needed
project = dex_sync_client.create_project(name="My Project")

When to use async:

Processing multiple documents in parallel
High-throughput applications
Modern async-first codebases

When to use sync:

Simple scripts
Interactive notebooks
Learning/prototyping
Sequential processing

Supported File Types

Dex supports a wide variety of document formats:

Images

PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC

Documents

PDF, DOCX, DOC, DOTX, WPD, TXT, RTF, PPTX, PPT

Spreadsheets

CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW

Note: For best results with spreadsheets, use XLSX format. CSV files are processed as-is without layout analysis.

Best Practices

Optimizing Extraction Accuracy

Write Clear Prompts

# ✅ Good: Specific and detailed
user_prompt = """Extract the following fields:
- Invoice number: Find in top-right corner or header
- Total amount: The final amount including tax
- Date: Invoice date, not payment date"""

# ❌ Bad: Vague
user_prompt = "Extract invoice info"

Design Good Schemas

# ✅ Good: Descriptive fields with clear types
class InvoiceData(BaseModel):
    invoice_number: str = Field(description="The invoice number (format: INV-XXXXX)")
    total_amount: float = Field(description="Total amount in USD, including tax")
    invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format")

# ❌ Bad: No descriptions
class InvoiceData(BaseModel):
    number: str
    amount: float
    date: str

Enable Citations and Confidence
- Always set generate_citations=True for debugging
- Use generate_confidence=True to filter low-confidence results
Use Vector Stores for Large Documents
- Documents > 50 pages benefit from RAG-enhanced extraction
- Vector stores improve accuracy for cross-document queries

Common Workflows

Single Document Processing:

Upload → Parse → Extract → Review Results

Multi-Document Analysis:

Upload All → Parse All → Create Vector Store → Add to Store → Extract → Aggregate

Iterative Refinement:

Parse → Review → Adjust Chunking/Engine → Re-parse → Extract

Production Pipeline:

Upload → Parse → Store in Vector DB → Extract on Demand → Cache Results

Performance Optimization

Reduce Latency

Use appropriate chunking: Smaller chunks = faster parsing
Limit OCR scope: Only process pages you need
Batch operations: Process multiple files in parallel
Cache parse results: Reuse parsed documents for multiple extractions

Reduce Costs

Set retention policies: Auto-delete old data
Choose right model: Use smaller models when possible
Optimize prompts: Shorter prompts = lower token costs
Filter before extraction: Use vector search to find relevant chunks first

Next Steps

Quick Reference: Cheat sheet for common patterns
API Reference: Complete SDK documentation
Troubleshooting: Common issues and solutions

Getting Started

Document Understanding

OCR

Workflows

Advanced Features

Vector Stores

Creating a Vector Store

Adding Documents to Vector Store

Semantic Search

RAG-Enhanced Extraction

Chunking Strategies

Available Chunking Methods

Setting Chunking Strategy

Chunking Decision Tree

Data Retention Policies

Setting Retention Policies

Updating Retention Policies

Retention Policy Use Cases

Batch Processing

Upload Multiple Files

Parse Multiple Documents

Batch Extraction

Multi-Language Support

Parsing Non-English Documents

Supported Languages

Sync vs Async

Async Client (Recommended)

Sync Client

Supported File Types

Images

Documents

Spreadsheets

Best Practices

Optimizing Extraction Accuracy

Common Workflows

Performance Optimization

Reduce Latency

Reduce Costs

Next Steps

Getting Started

Document Understanding

OCR

Workflows

​Vector Stores

​Creating a Vector Store

​Adding Documents to Vector Store

​Semantic Search

​RAG-Enhanced Extraction

​Chunking Strategies

​Available Chunking Methods

​Setting Chunking Strategy

​Chunking Decision Tree

​Data Retention Policies

​Setting Retention Policies

​Updating Retention Policies

​Retention Policy Use Cases

​Batch Processing

​Upload Multiple Files

​Parse Multiple Documents

​Batch Extraction

​Multi-Language Support

​Parsing Non-English Documents

​Supported Languages

​Sync vs Async

​Async Client (Recommended)

​Sync Client

​Supported File Types

​Images

​Documents

​Spreadsheets

​Best Practices

​Optimizing Extraction Accuracy

​Common Workflows

​Performance Optimization

​Reduce Latency

​Reduce Costs

​Next Steps

Vector Stores

Creating a Vector Store

Adding Documents to Vector Store

Semantic Search

RAG-Enhanced Extraction

Chunking Strategies

Available Chunking Methods

Setting Chunking Strategy

Chunking Decision Tree

Data Retention Policies

Setting Retention Policies

Updating Retention Policies

Retention Policy Use Cases

Batch Processing

Upload Multiple Files

Parse Multiple Documents

Batch Extraction

Multi-Language Support

Parsing Non-English Documents

Supported Languages

Sync vs Async

Async Client (Recommended)

Sync Client

Supported File Types

Images

Documents

Spreadsheets

Best Practices

Optimizing Extraction Accuracy

Common Workflows

Performance Optimization

Reduce Latency

Reduce Costs

Next Steps