Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt

Use this file to discover all available pages before exploring further.

Convert documents to structured format for extraction. Dex supports multiple parse engines and async job monitoring.

Parse Document (Default)

parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    )
)

OCR Engine Comparison

EngineBest ForStatusSpeedUse When
ReductoProduction workloads, batch processingProduction-readySomewhat high (batch-friendly)Standard production use
IrisCustom OCR needs, experimentationExperimentalConfigurableCustom models, accuracy/latency optimization
Azure VisionMultilingual documentsProduction-readyMediumMultilingual production use
Recommendation: Use Reducto for production. See When to choose Iris? for custom OCR needs.

Which OCR Engine?

Production use → Reducto
Multilingual → Azure Vision or Reducto
Custom OCR needs → Iris (experimental) - see guide

Async Job Monitoring

New in v0.4.0: Use start_parse_job for better control over async operations and access to SGP traces.

Monitor Parse Job

import asyncio

# Start a parse job (returns immediately)
parse_job = await project.start_parse_job(
    dex_file=dex_file,
    parameters=ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            )
        ),
    ),
)

# Monitor job progress
while parse_job.data.status not in [JobStatus.SUCCEEDED, JobStatus.FAILED]:
    await asyncio.sleep(1)
    await parse_job.refresh()
    print(f"Job status: {parse_job.data.status}")

# Get result
if parse_job.data.status == JobStatus.SUCCEEDED:
    parse_result = await parse_job.get_result()
    print("Parse completed successfully")
else:
    print(f"Parse failed: {parse_job.data.error_message}")

Retrieving SGP Traces for Debugging

from scale_gp_beta import SGPClient

sgp_client = SGPClient(
    api_key=os.getenv("SGP_API_KEY"),
    account_id=os.getenv("SGP_ACCOUNT_ID"),
)

# Search for job traces
spans = list(sgp_client.spans.search(
    sort_by="created_at",
    sort_order="desc",
    extra_metadata={"job_id": parse_job.data.id},
    parents_only=True,
))

if spans:
    trace_id = spans[0].trace_id
    all_spans = list(sgp_client.spans.search(trace_ids=[trace_id]))
    for span in all_spans:
        print(f"Span: {span.name}, Duration: {span.duration_ms}ms")

Process Multiple Files

import asyncio

# Parse all files in parallel
parse_tasks = [
    dex_file.parse(
        ReductoParseJobParams(
            engine=ParseEngine.REDUCTO,
            options=ReductoParseEngineOptions(
                chunking=ReductoChunkingOptions(
                    chunk_mode=ReductoChunkingMethod.VARIABLE,
                )
            ),
        )
    )
    for dex_file in dex_files
]

parse_results = await asyncio.gather(*parse_tasks)
print(f"Parsed {len(parse_results)} documents")

Multi-Language Support

Dex supports 35+ languages with automatic language detection.

Parsing Non-English Documents

# Language is automatically detected
parse_result = await dex_file.parse(
    ReductoParseJobParams(
        engine=ParseEngine.REDUCTO,
        options=ReductoParseEngineOptions(
            chunking=ReductoChunkingOptions(
                chunk_mode=ReductoChunkingMethod.VARIABLE,
            ),
            extraction_mode="ocr",  # Use OCR mode for better language support
        ),
    )
)

Supported Languages

Germanic languages have excellent support. Additional 35+ languages include:
  • European: Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Polish, Russian, Ukrainian
  • Asian: Chinese, Japanese, Korean, Thai, Vietnamese, Khmer, Lao
  • Middle Eastern: Arabic, Hebrew, Persian, Turkish
  • Indian: Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Punjabi
  • And many more…
See the Introduction guide for the complete list.

Appendix: Essential Imports

from dex_sdk.types import (
    ParseEngine,
    ReductoParseJobParams,
    ReductoChunkingMethod,
    ReductoChunkingOptions,
    ReductoParseEngineOptions,
    IrisParseJobParams,
    IrisParseEngineOptions,
    JobStatus,
)

Next Steps

  • Chunking: Choose chunking strategies for your documents
  • Extract: Extract structured data from parse results
  • Vector Stores: Use vector stores for large documents