Skip to main content
This guide covers advanced Dex features including vector stores, chunking strategies, data retention policies, batch processing, and optimization techniques.

Vector Stores

Vector stores enable semantic search and RAG-enhanced extraction for large documents or multi-file processing.

Creating a Vector Store

from dex_sdk.types import VectorStoreEngines

# Create a vector store for the project
vector_store = await project.create_vector_store(
    name="Financial Documents Store",
    engine=VectorStoreEngines.SGP_KNOWLEDGE_BASE,
    embedding_model="openai/text-embedding-3-large",
)

Adding Documents to Vector Store

# Add parsed files to the vector store
await vector_store.add_parse_results([parse_result.id])
print("Parse results added to vector store")
# Perform semantic search across all documents
search_results = await vector_store.search(
    query="What is the total taxable income?",
    top_k=5,
)

# Search within a specific file
file_search_results = await vector_store.search_in_file(
    file_id=dex_file.id,
    query="What is the total taxable income?",
    top_k=5,
    filters=None,  # Optional filters like {"file_id": "file_123"}
)

# Access search results
for chunk in search_results.chunks:
    print(f"Score: {chunk.score:.2f}")
    print(f"Content: {chunk.content[:100]}...")
    print(f"File: {chunk.file_id}, Page: {chunk.blocks[0].page_number if chunk.blocks else 'N/A'}")

RAG-Enhanced Extraction

Extract data using vector store context for improved accuracy on large documents:
# Extract from vector store with RAG context
extract_result = await vector_store.extract(
    extraction_schema=FinancialData,
    user_prompt="""Extract financial data from all documents in this store.
    Focus on quarterly income statements.""",
    model="openai/gpt-4o",
    generate_citations=True,
    generate_confidence=True,
)

# Access extracted data with citations
result = extract_result.result
for field_name, field in result.data.items():
    print(f"{field_name}: {field.value} (confidence: {field.confidence:.2f})")
    if field.citations:
        for cite in field.citations:
            print(f"  → From {cite.file_id}, page {cite.page}")

Chunking Strategies

Choose the right chunking method for your use case to optimize parsing and extraction.

Available Chunking Methods

MethodDescriptionBest ForEmbedding Suitability
VARIABLEVariable-size semantic chunksGeneral use, embeddings✅ Excellent
BLOCKLow-level layout blocksPrecise location tracking❌ Too small
PAGEPage-by-page chunksPage-oriented documents⚠️ May be large
SECTIONSection-level (titles/subtitles)Structured documents✅ Good
PAGE_SECTIONSSections within pagesHybrid approach✅ Good
DISABLEDNo chunking, single blockSpecial cases❌ Too large

Setting Chunking Strategy

from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
)

# Variable chunking (recommended for most cases)
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
                chunk_size=None,  # Auto-determined
            )
        ),
    )
)

# Block-level chunking for precise location tracking
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.BLOCK,
            )
        ),
    )
)

# Page-level chunking
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.PAGE,
            )
        ),
    )
)

Chunking Decision Tree

Use VARIABLE when:
  • Creating embeddings for semantic search
  • General document processing
  • You want optimal chunk sizes automatically
Use BLOCK when:
  • You need precise bounding box information
  • Building UI overlays on documents
  • Not using for embeddings
Use PAGE when:
  • Processing page-oriented documents
  • Need page-level analysis
  • Document structure aligns with pages
Use SECTION when:
  • Documents have clear section structure
  • Want semantic grouping
  • Using embeddings

Data Retention Policies

Configure automatic data lifecycle management for compliance and cost optimization.

Setting Retention Policies

from datetime import timedelta
from dex_sdk.types import ProjectConfiguration, RetentionPolicy

# Create project with retention policy
project = await dex_client.create_project(
    name="Compliant Project",
    configuration=ProjectConfiguration(
        retention=RetentionPolicy(
            files=timedelta(days=30),           # Files expire after 30 days
            result_artifacts=timedelta(days=7),  # Parse/extract results expire after 7 days
        )
    )
)

Updating Retention Policies

# Update existing project retention policy
await dex_client.update_project(
    project_id=project.id,
    updates={
        "configuration": ProjectConfiguration(
            retention=RetentionPolicy(
                files=timedelta(days=90),        # Extend to 90 days
                result_artifacts=timedelta(days=30),  # Extend to 30 days
            )
        )
    }
)

Retention Policy Use Cases

Compliance (GDPR, HIPAA):
# Auto-delete after regulatory period
retention=RetentionPolicy(
    files=timedelta(days=2555),  # 7 years
    result_artifacts=timedelta(days=365),  # 1 year
)
Cost Optimization:
# Keep only what you need
retention=RetentionPolicy(
    files=timedelta(days=30),    # Raw files for 1 month
    result_artifacts=timedelta(days=7),   # Results for 1 week
)
Security (Minimize Exposure):
# Short retention for sensitive data
retention=RetentionPolicy(
    files=timedelta(days=1),     # Delete files after 1 day
    result_artifacts=timedelta(hours=24),  # Delete results after 24 hours
)

Batch Processing

Process multiple documents efficiently with parallel operations.

Upload Multiple Files

import asyncio

# Upload multiple files in parallel
files_to_upload = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]

upload_tasks = [project.upload_file(file_path) for file_path in files_to_upload]
dex_files = await asyncio.gather(*upload_tasks)

print(f"Uploaded {len(dex_files)} files")

Parse Multiple Documents

# Parse all files in parallel
parse_tasks = [
    dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )
    for dex_file in dex_files
]

parse_results = await asyncio.gather(*parse_tasks)
print(f"Parsed {len(parse_results)} documents")

Batch Extraction

# Extract from multiple documents
extraction_tasks = [
    parse_result.extract(
        extraction_schema=InvoiceData,
        user_prompt="Extract invoice details.",
        model="openai/gpt-4o",
    )
    for parse_result in parse_results
]

extractions = await asyncio.gather(*extraction_tasks)

# Aggregate results
for i, extraction in enumerate(extractions):
    print(f"Document {i+1}: {extraction.result.data}")

Multi-Language Support

Dex supports 35+ languages with automatic language detection.

Parsing Non-English Documents

# Parse document in any supported language
# Language is automatically detected
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            ),
            extraction_mode="ocr",  # Use OCR mode for better language support
        ),
    )
)

Supported Languages

Germanic languages have excellent support. Additional 35+ languages include:
  • European: Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Polish, Russian, Ukrainian
  • Asian: Chinese, Japanese, Korean, Thai, Vietnamese, Khmer, Lao
  • Middle Eastern: Arabic, Hebrew, Persian, Turkish
  • Indian: Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Punjabi
  • And many more…
See the Introduction guide for the complete list.

Sync vs Async

Choose between async (default) or sync client based on your use case.
from dex_sdk import DexClient

# Use async for concurrent operations
dex_client = DexClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# All operations use await
project = await dex_client.create_project(name="My Project")

Sync Client

from dex_sdk import DexSyncClient

# Use sync for simpler scripts or interactive use
dex_sync_client = DexSyncClient(
    base_url="https://dex.sgp.scale.com",
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# No await needed
project = dex_sync_client.create_project(name="My Project")
When to use async:
  • Processing multiple documents in parallel
  • High-throughput applications
  • Modern async-first codebases
When to use sync:
  • Simple scripts
  • Interactive notebooks
  • Learning/prototyping
  • Sequential processing

Supported File Types

Dex supports a wide variety of document formats:

Images

PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC

Documents

PDF, DOCX, DOC, DOTX, WPD, TXT, RTF, PPTX, PPT

Spreadsheets

CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW
Note: For best results with spreadsheets, use XLSX format. CSV files are processed as-is without layout analysis.

Best Practices

Optimizing Extraction Accuracy

  1. Write Clear Prompts
    # ✅ Good: Specific and detailed
    user_prompt = """Extract the following fields:
    - Invoice number: Find in top-right corner or header
    - Total amount: The final amount including tax
    - Date: Invoice date, not payment date"""
    
    # ❌ Bad: Vague
    user_prompt = "Extract invoice info"
    
  2. Design Good Schemas
    # ✅ Good: Descriptive fields with clear types
    class InvoiceData(BaseModel):
        invoice_number: str = Field(description="The invoice number (format: INV-XXXXX)")
        total_amount: float = Field(description="Total amount in USD, including tax")
        invoice_date: str = Field(description="Invoice date in YYYY-MM-DD format")
    
    # ❌ Bad: No descriptions
    class InvoiceData(BaseModel):
        number: str
        amount: float
        date: str
    
  3. Enable Citations and Confidence
    • Always set generate_citations=True for debugging
    • Use generate_confidence=True to filter low-confidence results
  4. Use Vector Stores for Large Documents
    • Documents > 50 pages benefit from RAG-enhanced extraction
    • Vector stores improve accuracy for cross-document queries

Common Workflows

Single Document Processing:
Upload → Parse → Extract → Review Results
Multi-Document Analysis:
Upload All → Parse All → Create Vector Store → Add to Store → Extract → Aggregate
Iterative Refinement:
Parse → Review → Adjust Chunking/Engine → Re-parse → Extract
Production Pipeline:
Upload → Parse → Store in Vector DB → Extract on Demand → Cache Results

Performance Optimization

Reduce Latency

  1. Use appropriate chunking: Smaller chunks = faster parsing
  2. Limit OCR scope: Only process pages you need
  3. Batch operations: Process multiple files in parallel
  4. Cache parse results: Reuse parsed documents for multiple extractions

Reduce Costs

  1. Set retention policies: Auto-delete old data
  2. Choose right model: Use smaller models when possible
  3. Optimize prompts: Shorter prompts = lower token costs
  4. Filter before extraction: Use vector search to find relevant chunks first

Next Steps