Skip to main content
Convert documents to structured format for extraction. Dex supports multiple parse engines and async job monitoring.

Parse Document (Default)

parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

OCR Engine Comparison

EngineBest ForStatusSpeedUse When
ReductoProduction workloads, batch processingProduction-readySomewhat high (batch-friendly)Standard production use
IrisCustom OCR needs, experimentationExperimentalConfigurableCustom models, accuracy/latency optimization
Azure VisionMultilingual documentsProduction-readyMediumMultilingual production use
Recommendation: Use Reducto for production. See When to choose Iris? for custom OCR needs.

Which OCR Engine?

Production use → Reducto
Multilingual → Azure Vision or Reducto
Custom OCR needs → Iris (experimental) - see guide

Async Job Monitoring

New in v0.4.0: Use start_parse_job for better control over async operations and access to SGP traces.

Monitor Parse Job

import asyncio

# Start a parse job (returns immediately)
parse_job = await project.start_parse_job(
    dex_file=dex_file,
    parameters=ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    ),
)

# Monitor job progress
while parse_job.data.status not in [JobStatus.SUCCEEDED, JobStatus.FAILED]:
    await asyncio.sleep(1)
    await parse_job.refresh()
    print(f"Job status: {parse_job.data.status}")

# Get result
if parse_job.data.status == JobStatus.SUCCEEDED:
    parse_result = await parse_job.get_result()
    print("Parse completed successfully")
else:
    print(f"Parse failed: {parse_job.data.error_message}")

Retrieving SGP Traces for Debugging

from scale_gp_beta import SGPClient

sgp_client = SGPClient(
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# Search for job traces
spans = list(sgp_client.spans.search(
    sort_by="created_at",
    sort_order="desc",
    extra_metadata={"job_id": parse_job.data.id},
    parents_only=True,
))

if spans:
    trace_id = spans[0].trace_id
    all_spans = list(sgp_client.spans.search(trace_ids=[trace_id]))
    for span in all_spans:
        print(f"Span: {span.name}, Duration: {span.duration_ms}ms")

Process Multiple Files

import asyncio

# Parse all files in parallel
parse_tasks = [
    dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )
    for dex_file in dex_files
]

parse_results = await asyncio.gather(*parse_tasks)
print(f"Parsed {len(parse_results)} documents")

Multi-Language Support

Dex supports 35+ languages with automatic language detection.

Parsing Non-English Documents

# Language is automatically detected
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            ),
            extraction_mode="ocr",  # Use OCR mode for better language support
        ),
    )
)

Supported Languages

Germanic languages have excellent support. Additional 35+ languages include:
  • European: Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Polish, Russian, Ukrainian
  • Asian: Chinese, Japanese, Korean, Thai, Vietnamese, Khmer, Lao
  • Middle Eastern: Arabic, Hebrew, Persian, Turkish
  • Indian: Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Punjabi
  • And many more…
See the Introduction guide for the complete list.

Appendix: Essential Imports

from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
    IrisParseJobParams,
    IrisParseEngineOptions,
    JobStatus,
)

Next Steps

  • Chunking: Choose chunking strategies for your documents
  • Extract: Extract structured data from parse results
  • Vector Stores: Use vector stores for large documents