Overview
Dex is Scale’s document understanding capability that provides composable primitives for:- File Management - Secure file upload, storage, and retrieval with fine-grained access control
- Document Parsing - Convert any document (PDFs, DOCX, images, etc.) into structured JSON format with multiple OCR engines
- Vector Stores - Index and search parsed documents with semantic embeddings
- Data Extraction - Extract specific information using custom schemas, prompts, and RAG-enhanced context
- Project Management - Organize and isolate data with proper credential management and authorization
Prerequisites
Before using this repository, ensure you have:- ✅ A valid Scale account with SGP (Scale General Platform) access
- ✅ Your SGP account ID and API key set as environment variables:
- ✅ VPN connection to Scale’s internal network via all-traffic (not eng-split-prod)
- ✅ Python 3.8+ installed
- ✅ Required Python packages (see Installation section)
Installation
Install Dex SDK from CodeArtifact
With access to Scale CodeArtifact, install the Dex SDK:Quick Start
1. Initialize Dex Client
2. Create a Project
Projects isolate your data and credentials for tracing, billing, and SGP model calls. Every operation is tied to a project. Key principle: Keep one project per use case, or group of files for clean traceability.3. Upload a Document
Dex manages your files with persistent storage. Upload files directly and always store the returned file object — you’ll need it for parsing. Supported file types:- Images: PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC
- PDFs: PDF (Portable Document Format)
- Spreadsheets: CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW
- Documents: PPTX, PPT, DOCX, DOC, DOTX, WPD, TXT, RTF
4. Parse the Document
Parsing normalizes any document into a unified JSON format with:- Chunks – chunked content optimized for embedding
- Blocks – semantically separated contents with types like
Title,Subtitle,Text,Figure, etc. - Bounding boxes – page position info for UI overlays
- Default: Reducto
- Swappable: you can bring your own OCR engine
Note: Parsing is an asynchronous process. The SDK handles the job monitoring for you.
5. Extract Structured Data
Dex provides a highly customizable extraction pipeline for high accuracy results.Advanced Features
Working with Vector Stores
Vector stores enable semantic search and improved extraction for large documents or multi-file processing:Chunking Strategies
Choose the right chunking strategy for your use case (BLOCK, VARIABLE):Multi-language Support
Dex supports multiple languages out of the box:Troubleshooting
Common Issues
1. VPN Connection Problems- Missing or incorrect
SGP_ACCOUNT_IDandSGP_API_KEY - Insufficient permissions on Scale account
- Account doesn’t have SGP access
- Images: PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEIC
- PDFs: PDF (Portable Document Format)
- Spreadsheets: CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPW
- Documents: PPTX, PPT, DOCX, DOC, DOTX, WPD, TXT, RTF
Ways to Interact with Dex
Python SDK (Recommended)
The Python SDK provides a high-level, developer-friendly interface:REST API
Direct API access for custom integrations:API Reference
For complete SDK documentation including all methods, parameters, and examples, see the Dex SDK API Reference. Quick reference:- DexClient: Project management (
create_project,list_projects) - Project: File operations, vector store operations
- DexFile: Document parsing (
parse) - ParseResult: Data extraction (
extract,extract_async) - VectorStore: Indexing, semantic search, RAG-enhanced extraction
Best Practices
Choosing the Right Chunking Strategy
- Block-level: Use for fine-grained analysis when you need precise location information. Not recommended for embeddings due to small chunk size.
- Section-level: Best for most use cases. Provides semantic grouping based on document structure (titles, subtitles).
- Page-level: Ideal for page-by-page analysis and when document structure is page-oriented.
Optimizing Extraction Accuracy
- Use clear, specific prompts: Provide detailed instructions about what to extract and how to format it in the
user_prompt. - Design good schemas: Use descriptive field names and detailed
Field()descriptions in your Pydantic models. - Convert schema properly: Always use
YourModel.model_json_schema()when passing toextraction_schema. - Enable citations: Set
generate_citations=Trueto help with debugging and provide auditability. - Enable confidence scores: Set
generate_confidence=Trueto filter low-confidence results for human review. - Use vector stores for large documents: RAG-enhanced extraction improves accuracy for documents exceeding context windows.
- Let the SDK handle async: The SDK automatically handles job monitoring for both parsing and extraction.
Common Workflows
Single Document Processing:- Upload file → Parse → Extract → Get results
- Upload files → Parse all → Create vector store → Add to store → Extract from store
- Parse with default settings → Review results → Adjust chunking/OCR → Re-parse → Extract
- Define custom MCP tools → Parse documents → Extract with agent tools → Get enriched results
Contributing
When adding new test cases or examples:- Follow the existing notebook structure
- Include clear documentation and comments
- Test with various document types
- Update this README with new features or examples
Next Steps
Now that you’re familiar with Dex basics, explore these advanced topics:Learn More
- Introduction to Dex: Understand core concepts and architecture
- Dex SDK API Reference: Complete SDK documentation with examples
- REST API Reference: Complete REST API documentation with all endpoints
- Evaluation Guide: Measure and improve extraction accuracy
Advanced Topics
- Custom OCR Integration: Integrate your own OCR engine for specialized documents
- Agentic Extraction: Use custom MCP tools for complex extraction workflows
- Multi-lingual Processing: Process documents in multiple languages with language-specific OCR
- Fine-grained Access Control: Implement ReBAC for secure document processing
Support
For issues or questions:- Documentation: Check the troubleshooting section above
- Slack Support: Contact the Dex team at
#sgp-document-understanding-capability - Feature Requests: Reach out to the Dex team with your use case
Useful Resources
- Dex Service Documentation: Internal API docs and architecture
- SGP Platform Docs: Scale General Platform documentation
- Example Notebooks: Sample code and workflows

