Skip to main content
This guide walks you through parsing documents & extracting structured data in Compass to help you utilize any unstructured data for your workflows (powered by the our Dex SDK). You’ll learn how to connect to cloud file storage, filter and parse documents, extract structured data, store parsed data to Knowledge Base, and use the results in your workflows.

Step 1: Connect to cloud file storage

Start by navigating to Workflows in SGP. Click on the Cloud Storage Browser card to connect to your cloud file storage. Cloud Storage Browser Card Currently, Compass supports the following cloud storage providers:
  • Amazon S3
  • Azure Blob Storage
  • Google Cloud Storage
For this example, we’ll connect to an Amazon S3 bucket containing resume documents that we want to parse and analyze. Select your cloud storage provider and choose your registered connection from the dropdown. Browse or specify the path to the files you want to process.
To add an S3 bucket for use with the Cloud Storage Browser card, you must first create a bucket using Terraform and add the bucket to the existing policy within the Compass role. Please reach out to the Compass team for example PRs that illustrate how to take those steps.

Step 2: Filter files (optional)

Before parsing, you can use a Filter card to select only the files you want to process. This helps you narrow down your dataset and avoid parsing unnecessary documents. Filter Files Card Purely as an example, here we only filter for resumes of a certain size. This step is optional but recommended when working with large file sets to optimize processing time and costs.

Step 3: Parse & Extract from documents with the Dex Card

This features utilizes our Document Understanding capability, our Dex SDK on the backend. Add a Dex card to extract structured data from your documents. This card utilizes the Dex SDK in the backend and automatically creates a Dex project that you can reference later in the Dex UI. Dex Project configs

Dex Project Configuration

  • Project Name: Give your Dex project a descriptive name (e.g., “Resume Parsing Project”)
  • Project ID (Optional):
    • Enter an existing project ID to reuse a project
    • Leave empty to create a new project each time the workflow runs

Parsing

This is the step that converts complex, unstructured files into LLM-ready JSON or markdown format. Dex Parse configs Choose the OCR engine based on your document language and layout:
  • Engine: Select from available parsing engines (e.g., Reducto). Reducto: Best for English and Latin-script documents with tables, figures, and complex layouts
For Reducto, configure the following settings to select how documents should be split into chunks:
  • Variable (recommended): Automatically determines optimal chunk boundaries
  • Chunk Size (Optional): Set to Auto to let the engine determine the best size
You can also enable contextual retrieval which uses an LLM to prepend document-level context to each chunk, improving retrieval accuracy. Control the visibility of the output data: Include parsed data in output: Toggle on to output parsed text content for each chunk
  • When enabled: Outputs parsed text content with full details
  • When disabled: Outputs one row per file with metadata (project ID, parsed result ID) only
Finally, you can configure parse retries & per-file parse timeouts as per your workflow’s needs.

Structured Extraction

Optionally extract specific fields from parsed documents using LLM-based extraction. This feature pulls targeted data from unstructured documents into structured metadata columns. Dex Extraction Run extraction after parse: Toggle on to enable structured extraction. This uses the Dex SDK on each parse result (per file). Note: Large documents may hit LLM context limits When enabled, configure the following: Schema Definition Define the fields you want to extract. Add all the fields you need:
  • Column name: Name for the extracted field (e.g., “latestPosition”)
  • Data type: Select the data type (String, Number, Boolean, etc.)
  • Description: Describe what should be extracted (e.g., “What’s the latest position they worked at”)
  • Click Add field to define additional extraction fields
Extraction Options
  • Generate citations: Toggle on to return bounding boxes for extracted values. Enabling this increases the size of each row payload
  • Generate confidence: Toggle on to return confidence scores for extracted values. Enabling this increases the size of each row payload
Extraction Model Configuration
  • Extraction model: Specify the LLM to use for extraction (e.g., “openai/gpt-4o”, “vertex_ai/gemini-2.5-pro”). This uses LiteLLM-style model IDs
  • Extraction retries: Number of retries for structured extraction (0-10). Default: 3 retries
Provide custom instructions to guide the extraction process. For example:
From the provided text, extract the following: **Patient Info** - patient name in the document - what is the patient symptom?
Extracted field keys in the output will use the extract_ prefix to distinguish them from standard parse output columns.

Knowledge Base

Optionally push parsed results into a vector store for semantic search and RAG:
  • Push to Vector Store: Toggle on to enable Knowledge Base integration
  • Vector Store Name: Provide a name for your Knowledge Base (e.g., “Resume Vector Store”)
  • Embedding Model: Specify the embedding model (e.g., openai/text-embedding-3-large)
When enabled, parsed documents will be automatically indexed in the SGP Knowledge Base, accessible via the Knowledge Base UI for semantic search and retrieval.

Step 4: View parsed & extracted output

After the Dex card processes your documents, the output is a dataframe with the following columns: Standard Parse Columns
  • parseResultId: Unique identifier for the parsing result
  • projectId: The Dex project ID where results are stored
  • engine: The parsing engine used (e.g., “Reducto”)
  • pagesProcessed: Number of pages processed in the document
  • chunkCount: Number of chunks created from the document
  • status: Parsing status (e.g., “completed”, “failed”)
  • parsedContent: The extracted text content from the document
  • vectorStoreId: ID of the Knowledge Base vector store (if enabled)
Structured Extraction Columns (if enabled) When structured extraction is enabled, additional columns are added with the extract* prefix:
  • extract_<fieldName>: The extracted value for each defined field (e.g., extract_latestPosition)
  • extract_<fieldName>_citations: Bounding box coordinates for the extracted value (if citations enabled)
  • extract_<fieldName>_confidence: Confidence score for the extraction (if confidence enabled)
You can use this dataframe in subsequent workflow cards to analyze, transform, or process the parsed data.

Step 5: Access Dex project and SGP Knowledge Base

The Dex Parse card automatically creates a Dex project that you can access later for review and management. If you enabled the “Push to Vector Store” option, navigate to the Knowledge Base section in SGP to access your vector store:
  • Search parsed documents semantically
  • Test retrieval quality
  • Use for RAG applications
  • Manage embeddings and vector store settings

Step 6: Use parsed / extracted data in your workflow

Now you can use the parsed content in subsequent workflow cards. For this example, we’ll call an agent to categorize resume content into suitable job roles.

Call Agent for Analysis

Add a Call Agent card to process the parsed content:
  • Select Agent: Choose your agent from the dropdown
  • Output Column Name: Specify where agent responses will be stored (e.g., job_categories)
  • Prompt Template: Write your prompt using {{parsedContent}} to reference the parsed text
Dex Call Agent

Additional Processing

You can add more cards to:
  • Filter or transform the categorized results
  • Join with job posting data
  • Run evaluations on categorization quality
  • Generate reports or summaries

Step 7: Save your results

Finally, save the workflow output for future reference and analysis. You can save the results as:
  • SGP Dataset: Store the categorized resumes as a dataset for future use in training, evaluation, or analysis
  • SGP Evaluation: Create an evaluation to track the quality of resume categorization over time
Both options allow you to version your results, track changes, and reference the data in other workflows or dashboards. Dex Final Export