Vector Stores
Vector stores enable semantic search and RAG-enhanced extraction for large documents or multi-file processing.Creating a Vector Store
Adding Documents to Vector Store
Semantic Search
RAG-Enhanced Extraction
Extract data using vector store context for improved accuracy on large documents:Chunking Strategies
Choose the right chunking method for your use case to optimize parsing and extraction.Available Chunking Methods
| Method | Description | Best For | Embedding Suitability |
|---|---|---|---|
VARIABLE | Variable-size semantic chunks | General use, embeddings | ✅ Excellent |
BLOCK | Low-level layout blocks | Precise location tracking | ❌ Too small |
PAGE | Page-by-page chunks | Page-oriented documents | ⚠️ May be large |
SECTION | Section-level (titles/subtitles) | Structured documents | ✅ Good |
PAGE_SECTIONS | Sections within pages | Hybrid approach | ✅ Good |
DISABLED | No chunking, single block | Special cases | ❌ Too large |
Setting Chunking Strategy
Chunking Decision Tree
Use VARIABLE when:- Creating embeddings for semantic search
- General document processing
- You want optimal chunk sizes automatically
- You need precise bounding box information
- Building UI overlays on documents
- Not using for embeddings
- Processing page-oriented documents
- Need page-level analysis
- Document structure aligns with pages
- Documents have clear section structure
- Want semantic grouping
- Using embeddings
Data Retention Policies
Configure automatic data lifecycle management for compliance and cost optimization.Setting Retention Policies
Updating Retention Policies
Retention Policy Use Cases
Compliance (GDPR, HIPAA):Batch Processing
Process multiple documents efficiently with parallel operations.Upload Multiple Files
Parse Multiple Documents
Batch Extraction
Multi-Language Support
Dex supports 35+ languages with automatic language detection.Parsing Non-English Documents
Supported Languages
Germanic languages have excellent support. Additional 35+ languages include:- European: Spanish, French, German, Italian, Portuguese, Dutch, Swedish, Norwegian, Danish, Polish, Russian, Ukrainian
- Asian: Chinese, Japanese, Korean, Thai, Vietnamese, Khmer, Lao
- Middle Eastern: Arabic, Hebrew, Persian, Turkish
- Indian: Hindi, Bengali, Tamil, Telugu, Malayalam, Kannada, Gujarati, Marathi, Punjabi
- And many more…
Sync vs Async
Choose between async (default) or sync client based on your use case.Async Client (Recommended)
Sync Client
- Processing multiple documents in parallel
- High-throughput applications
- Modern async-first codebases
- Simple scripts
- Interactive notebooks
- Learning/prototyping
- Sequential processing
Supported File Types
Dex supports a wide variety of document formats:Images
PNG, JPEG/JPG, GIF, BMP, TIFF, PCX, PPM, APNG, PSD, CUR, DCX, FTEX, PIXAR, HEICDocuments
PDF, DOCX, DOC, DOTX, WPD, TXT, RTF, PPTX, PPTSpreadsheets
CSV, XLSX, XLSM, XLS, XLTX, XLTM, QPWNote: For best results with spreadsheets, use XLSX format. CSV files are processed as-is without layout analysis.
Best Practices
Optimizing Extraction Accuracy
-
Write Clear Prompts
-
Design Good Schemas
-
Enable Citations and Confidence
- Always set
generate_citations=Truefor debugging - Use
generate_confidence=Trueto filter low-confidence results
- Always set
-
Use Vector Stores for Large Documents
- Documents > 50 pages benefit from RAG-enhanced extraction
- Vector stores improve accuracy for cross-document queries
Common Workflows
Single Document Processing:Performance Optimization
Reduce Latency
- Use appropriate chunking: Smaller chunks = faster parsing
- Limit OCR scope: Only process pages you need
- Batch operations: Process multiple files in parallel
- Cache parse results: Reuse parsed documents for multiple extractions
Reduce Costs
- Set retention policies: Auto-delete old data
- Choose right model: Use smaller models when possible
- Optimize prompts: Shorter prompts = lower token costs
- Filter before extraction: Use vector search to find relevant chunks first
Next Steps
- Quick Reference: Cheat sheet for common patterns
- API Reference: Complete SDK documentation
- Troubleshooting: Common issues and solutions

