Guide to automating evals on Compass

This guide walks you through building an automated evaluation workflow in Compass. You’ll learn how to query data, call your agent, join ground truth datasets, run LLM-as-Judge evaluations, and schedule everything to run automatically.

Step 1: Connect to the data source

Start by navigating to Workflows in SGP. Choose from the available data connection options: Traces, Data Sources, or CSV.

For this example, we’ll connect to a Snowflake data source. The data source has a list of supermarket items with product name & description that we must categorize using our agent.

Select Data Sources, then choose your registered Snowflake connection from the dropdown. Write a SQL query to fetch the data you want to evaluate. You can set a Row Limit up to 1,000,000.

Only read-only SQL queries are allowed. INSERT, UPDATE, DELETE, DROP, and other modifying statements are not permitted.

Step 2: Call your agent for completions

Use a Call Agent card to generate outputs from your agent for each row in your dataset. The Agentex agent you select has to be live on your SGP deployment.

In this example, we ensure our agent categorizes the super market items appropriately using our prompt. The configurations are:

Select Agent: Choose the agent you want to evaluate from the dropdown
Output Column Name: Specify the column name where agent responses will be stored (e.g., agent_response)
Include Traces: Toggle on if you want to fetch execution traces for debugging
Prompt Template: Write your prompt using {{variableName}} syntax to reference columns from your data. Click the variable buttons to insert column references automatically.

Under Settings, you can configure:

Timeout Duration: Maximum time in seconds to wait for agent responses (default: 300)
Number of retries: How many times to retry failed calls
Response Caching: Enable to cache responses and avoid redundant API calls

Step 3: Join data from another workflow (ex: Ground Truth)

Use the Join Workflow card to bring in data from another Compass workflow—such as a ground truth dataset for comparison. Here we will compare our agent’s categorization against the ground truth categorization for each product.

Select Workflow: Choose the workflow containing your ground truth or reference data
Join Type: Select how you want to combine the datasets:
- Left Join: Keep all rows from the current workflow
- Right Join: Keep all rows from the target workflow
- Inner Join: Keep only matching rows
- Outer Join: Keep all rows from both workflows
Join Keys: Select the matching columns from each workflow (e.g., trace_id)
Execute target workflow before joining: Enable this to get fresh data from the target workflow; otherwise, the most recent execution result will be used

Step 4: Use the LLM as Judge card to do evals

Add an LLM as Judge card to evaluate your agent’s outputs against the ground truth.

Select Judge: Choose to create a new judge or use an existing one
Judge Type: Select from available options like
- Legacy LLM as Judge: Traditional LLM judge with custom configuration
- Base Evals Agent as Judge: General-purpose evaluation agent with flexible capabilities
Model: Choose the LLM model to use for judging (e.g., openai/gpt-4o)
Temperature: Set to 0 for deterministic outputs
LLM Judge Name: Give your judge a name for future usage
Rubric: Define evaluation criteria. Use {{columnName}} to reference data columns

After configuring the judge, scroll down to the Evaluation Details section to export your results as an evaluation that can be accessed later anytime, which is helpful if you are running the evaluations on a schedule.

Evaluation Name: A descriptive name for your evaluation
Description: Explain what this evaluation measures
Tags: Add tags to filter and organize evaluations by groups later
Save judge for future reuse: Check to save this judge configuration & use from existing judge dropdown in other workflows
Create new evaluation on re-run: Check to create a new evaluation each time the workflow runs

Every time you run this LLM as judge card, you’ll see a success message with a direct link to view your evaluation results. Depending on what configurations you selected, a new evaluation will be created or the old one will be replaced. They can be accessed & referenced in the Evaluations & Dashboards tab anytime respectively.

Step 5: Open the evals in Evaluations tab or Dashboard tab

Navigate to the Evaluations tab in SGP to view your evaluation results. You can search, filter by date range, and filter by tags to find specific evaluations.

Click on an evaluation to see:

Overview: Visual summary of evaluation metrics and response distributions
Data: Detailed row-by-row results with all columns
Metadata: Information about when and how the evaluation was created

From here, you can also open the evaluation in the Dashboards tab for deeper analysis and custom visualizations.

Step 6: Automate the evals to run on a schedule

To automate your evaluation workflow, click Edit Workflow and configure a schedule.

Configure the schedule:

Set a Schedule: Toggle on to enable scheduling
Days: Choose Daily to run every day, or select specific days of the week
Time: Set the frequency (e.g., Hourly) or a specific time
End Date: Schedules can run for up to 30 days into the future
Enable Notifications: Toggle on to receive Slack alerts for failed runs

Click Save to activate your scheduled workflow. You can monitor all scheduled runs from the Execution logs in the side menu, and view Workflow history to track changes across versions.

Getting Started

Document Understanding

OCR

Workflows

Guide to automating evals on Compass

Step 1: Connect to the data source

Step 2: Call your agent for completions

Step 3: Join data from another workflow (ex: Ground Truth)

Step 4: Use the LLM as Judge card to do evals

Step 5: Open the evals in Evaluations tab or Dashboard tab

Step 6: Automate the evals to run on a schedule

Getting Started

Document Understanding

OCR

Workflows

​Step 1: Connect to the data source

​Step 2: Call your agent for completions

​Step 3: Join data from another workflow (ex: Ground Truth)

​Step 4: Use the LLM as Judge card to do evals

​Step 5: Open the evals in Evaluations tab or Dashboard tab

​Step 6: Automate the evals to run on a schedule

Step 1: Connect to the data source

Step 2: Call your agent for completions

Step 3: Join data from another workflow (ex: Ground Truth)

Step 4: Use the LLM as Judge card to do evals

Step 5: Open the evals in Evaluations tab or Dashboard tab

Step 6: Automate the evals to run on a schedule