Skip to main content
This guide walks you through parsing documents in Compass to utilize unstructured data for your workflows (powered by the our Dex SDK). You’ll learn how to connect to cloud file storage, filter and parse documents, store parsed data to Knowledge Base, and use the results in your workflows.

Step 1: Connect to cloud file storage

Start by navigating to Workflows in SGP. Click on the Cloud Storage Browser card to connect to your cloud file storage. Cloud Storage Browser Card Currently, Compass supports the following cloud storage providers:
  • Amazon S3
  • Azure Blob Storage
  • Google Cloud Storage
For this example, we’ll connect to an Amazon S3 bucket containing resume documents that we want to parse and analyze. Select your cloud storage provider and choose your registered connection from the dropdown. Browse or specify the path to the files you want to process.
To add an S3 bucket for use with the Cloud Storage Browser card, you must first create a bucket using Terraform and add the bucket to the existing policy within the Compass role. Please reach out to the Compass team for example PRs that illustrate how to take those steps.

Step 2: Filter files (optional)

Before parsing, you can use a Filter card to select only the files you want to process. This helps you narrow down your dataset and avoid parsing unnecessary documents. Filter Files Card Purely as an example, here we only filter for resumes of a certain size. This step is optional but recommended when working with large file sets to optimize processing time and costs.

Step 3: Parse documents with the Dex Parse Card

This features utilizes our Document Understanding capability, our Dex SDK on the backend. Add a Dex Parse card to extract structured data from your documents. This card utilizes the Dex SDK in the backend and automatically creates a Dex project that you can reference later in the Dex UI. Dex Parse Card Configuration

Dex Project Configuration

  • Project Name: Give your Dex project a descriptive name (e.g., “Resume Parsing Project”)
  • Project ID (Optional):
    • Enter an existing project ID to reuse a project
    • Leave empty to create a new project each time the workflow runs

Parse Engine

Choose the OCR engine based on your document language and layout:
  • Engine: Select from available parsing engines (e.g., Reducto)
    • Reducto: Best for English and Latin-script documents with tables, figures, and complex layouts

Engine Configuration

For Reducto, configure the following settings:
  • Chunking Method: Select how documents should be split into chunks
    • Variable (recommended): Automatically determines optimal chunk boundaries
  • Chunk Size (Optional): Set to Auto to let the engine determine the best size

Output Options

Control the visibility of the output data:
  • Include parsed data in output: Toggle on to output parsed text content for each chunk
    • When enabled: Outputs parsed text content with full details
    • When disabled: Outputs one row per file with metadata (project ID, parsed result ID) only

Knowledge Base

Optionally push parsed results into a vector store for semantic search and RAG:
  • Push to Vector Store: Toggle on to enable Knowledge Base integration
  • Vector Store Name: Provide a name for your Knowledge Base (e.g., “Resume Vector Store”)
  • Embedding Model: Specify the embedding model (e.g., openai/text-embedding-3-large)
When enabled, parsed documents will be automatically indexed in the SGP Knowledge Base, accessible via the Knowledge Base UI for semantic search and retrieval.

Step 4: View parsed output

After the Dex Parse card processes your documents, the output is a dataframe with the following columns:
  • parseResultId: Unique identifier for the parsing result
  • projectId: The Dex project ID where results are stored
  • engine: The parsing engine used (e.g., “Reducto”)
  • pagesProcessed: Number of pages processed in the document
  • chunkCount: Number of chunks created from the document
  • status: Parsing status (e.g., “completed”, “failed”)
  • parsedContent: The extracted text content from the document
  • vectorStoreId: ID of the Knowledge Base vector store (if enabled)
You can use this dataframe in subsequent workflow cards to analyze, transform, or process the parsed data.

Step 5: Access Dex project and SGP Knowledge Base

The Dex Parse card automatically creates a Dex project that you can access later for review and management. If you enabled the “Push to Vector Store” option, navigate to the Knowledge Base section in SGP to access your vector store:
  • Search parsed documents semantically
  • Test retrieval quality
  • Use for RAG applications
  • Manage embeddings and vector store settings

Step 6: Use parsed data in your workflow

Now you can use the parsed content in subsequent workflow cards. For this example, we’ll call an agent to categorize resume content into suitable job roles.

Call Agent for Analysis

Add a Call Agent card to process the parsed content:
  • Select Agent: Choose your agent from the dropdown
  • Output Column Name: Specify where agent responses will be stored (e.g., job_categories)
  • Prompt Template: Write your prompt using {{parsedContent}} to reference the parsed text
Dex Call Agent

Additional Processing

You can add more cards to:
  • Filter or transform the categorized results
  • Join with job posting data
  • Run evaluations on categorization quality
  • Generate reports or summaries

Step 7: Save your results

Finally, save the workflow output for future reference and analysis. You can save the results as:
  • SGP Dataset: Store the categorized resumes as a dataset for future use in training, evaluation, or analysis
  • SGP Evaluation: Create an evaluation to track the quality of resume categorization over time
Both options allow you to version your results, track changes, and reference the data in other workflows or dashboards. Dex Final Export