Guide to document parsing in Compass

This guide walks you through parsing documents in Compass to utilize unstructured data for your workflows (powered by the our Dex SDK). You’ll learn how to connect to cloud file storage, filter and parse documents, store parsed data to Knowledge Base, and use the results in your workflows.

Step 1: Connect to cloud file storage

Start by navigating to Workflows in SGP. Click on the Cloud Storage Browser card to connect to your cloud file storage.

Currently, Compass supports the following cloud storage providers:

Amazon S3
Azure Blob Storage
Google Cloud Storage

For this example, we’ll connect to an Amazon S3 bucket containing resume documents that we want to parse and analyze. Select your cloud storage provider and choose your registered connection from the dropdown. Browse or specify the path to the files you want to process.

To add an S3 bucket for use with the Cloud Storage Browser card, you must first create a bucket using Terraform and add the bucket to the existing policy within the Compass role. Please reach out to the Compass team for example PRs that illustrate how to take those steps.

Step 2: Filter files (optional)

Before parsing, you can use a Filter card to select only the files you want to process. This helps you narrow down your dataset and avoid parsing unnecessary documents.

Purely as an example, here we only filter for resumes of a certain size. This step is optional but recommended when working with large file sets to optimize processing time and costs.

Step 3: Parse documents with the Dex Parse Card

This features utilizes our Document Understanding capability, our Dex SDK on the backend. Add a Dex Parse card to extract structured data from your documents. This card utilizes the Dex SDK in the backend and automatically creates a Dex project that you can reference later in the Dex UI.

Dex Project Configuration

Project Name: Give your Dex project a descriptive name (e.g., “Resume Parsing Project”)
Project ID (Optional):
- Enter an existing project ID to reuse a project
- Leave empty to create a new project each time the workflow runs

Parse Engine

Choose the OCR engine based on your document language and layout:

Engine: Select from available parsing engines (e.g., Reducto)
- Reducto: Best for English and Latin-script documents with tables, figures, and complex layouts

Engine Configuration

For Reducto, configure the following settings:

Chunking Method: Select how documents should be split into chunks
- Variable (recommended): Automatically determines optimal chunk boundaries
Chunk Size (Optional): Set to Auto to let the engine determine the best size

Output Options

Control the visibility of the output data:

Include parsed data in output: Toggle on to output parsed text content for each chunk
- When enabled: Outputs parsed text content with full details
- When disabled: Outputs one row per file with metadata (project ID, parsed result ID) only

Knowledge Base

Optionally push parsed results into a vector store for semantic search and RAG:

Push to Vector Store: Toggle on to enable Knowledge Base integration
Vector Store Name: Provide a name for your Knowledge Base (e.g., “Resume Vector Store”)
Embedding Model: Specify the embedding model (e.g., openai/text-embedding-3-large)

When enabled, parsed documents will be automatically indexed in the SGP Knowledge Base, accessible via the Knowledge Base UI for semantic search and retrieval.

Step 4: View parsed output

After the Dex Parse card processes your documents, the output is a dataframe with the following columns:

parseResultId: Unique identifier for the parsing result
projectId: The Dex project ID where results are stored
engine: The parsing engine used (e.g., “Reducto”)
pagesProcessed: Number of pages processed in the document
chunkCount: Number of chunks created from the document
status: Parsing status (e.g., “completed”, “failed”)
parsedContent: The extracted text content from the document
vectorStoreId: ID of the Knowledge Base vector store (if enabled)

You can use this dataframe in subsequent workflow cards to analyze, transform, or process the parsed data.

Step 5: Access Dex project and SGP Knowledge Base

The Dex Parse card automatically creates a Dex project that you can access later for review and management. If you enabled the “Push to Vector Store” option, navigate to the Knowledge Base section in SGP to access your vector store:

Search parsed documents semantically
Test retrieval quality
Use for RAG applications
Manage embeddings and vector store settings

Step 6: Use parsed data in your workflow

Now you can use the parsed content in subsequent workflow cards. For this example, we’ll call an agent to categorize resume content into suitable job roles.

Call Agent for Analysis

Add a Call Agent card to process the parsed content:

Select Agent: Choose your agent from the dropdown
Output Column Name: Specify where agent responses will be stored (e.g., job_categories)
Prompt Template: Write your prompt using {{parsedContent}} to reference the parsed text

Additional Processing

You can add more cards to:

Filter or transform the categorized results
Join with job posting data
Run evaluations on categorization quality
Generate reports or summaries

Step 7: Save your results

Finally, save the workflow output for future reference and analysis. You can save the results as:

SGP Dataset: Store the categorized resumes as a dataset for future use in training, evaluation, or analysis
SGP Evaluation: Create an evaluation to track the quality of resume categorization over time

Both options allow you to version your results, track changes, and reference the data in other workflows or dashboards.

Getting Started

Document Understanding

OCR

Workflows

Training

Guide to document parsing in Compass

Step 1: Connect to cloud file storage

Step 2: Filter files (optional)

Step 3: Parse documents with the Dex Parse Card

Dex Project Configuration

Parse Engine

Engine Configuration

Output Options

Knowledge Base

Step 4: View parsed output

Step 5: Access Dex project and SGP Knowledge Base

Step 6: Use parsed data in your workflow

Call Agent for Analysis

Additional Processing

Step 7: Save your results

Getting Started

Document Understanding

OCR

Workflows

Training

​Step 1: Connect to cloud file storage

​Step 2: Filter files (optional)

​Step 3: Parse documents with the Dex Parse Card

​Dex Project Configuration

​Parse Engine

​Engine Configuration

​Output Options

​Knowledge Base

​Step 4: View parsed output

​Step 5: Access Dex project and SGP Knowledge Base

​Step 6: Use parsed data in your workflow

​Call Agent for Analysis

​Additional Processing

​Step 7: Save your results

Step 1: Connect to cloud file storage

Step 2: Filter files (optional)

Step 3: Parse documents with the Dex Parse Card

Dex Project Configuration

Parse Engine

Engine Configuration

Output Options

Knowledge Base

Step 4: View parsed output

Step 5: Access Dex project and SGP Knowledge Base

Step 6: Use parsed data in your workflow

Call Agent for Analysis

Additional Processing

Step 7: Save your results