This guide walks you through building an automated evaluation workflow in Workflows. You’ll learn how to query data, call your agent, join ground truth datasets, run LLM-as-Judge evaluations, and schedule everything to run automatically.Documentation Index
Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
Use this file to discover all available pages before exploring further.
Step 1: Connect to the data source
Start by navigating toWorkflows in SGP. Choose from the available data connection options: Traces, Data Sources, or CSV. We add more options based on user demand.


Data Sources, then choose your registered Snowflake connection from the dropdown. Write a SQL query to fetch the data you want to evaluate. You can set a Row Limit up to 1,000,000.
Only read-only SQL queries are allowed. INSERT, UPDATE, DELETE, DROP, and other modifying statements are not permitted.
Step 2: Call your agent for completions
Use aCall Agent card to generate outputs from your agent for each row in your dataset. The Agentex agent you select has to be deployed on your SGP instance to be available in the dropdown.

Select Agent: Choose the agent you want to evaluate from the dropdownOutput Column Name: Specify the column name where agent responses will be stored (e.g.,agent_response)Include Traces: Toggle on if you want to fetch execution traces for debuggingPrompt Template: Write your prompt using{{variableName}}syntax to reference columns from your data. Click the variable buttons to insert column references automatically.
Settings, you can configure:
Timeout Duration: Maximum time in seconds to wait for agent responses (default: 300)Number of retries: How many times to retry failed callsResponse Caching: Enable to cache responses and avoid redundant API calls
Advanced settings, you can configure the completion criteria that determines when to stop polling for agent messages and the batch size that determines how many rows to process concurrently. This card generates the agent task IDs immediately as output.

Call agent card, you need to use the Get agent messages card right after to get the agent’s responses into the dataframe. The Get agent messages card takes the agent task IDs created in the previous card as input, along with configuration options such as agent message limits and matching criteria, to receive exactly what you need from the agent response (1st message, last message, tool calls, etc.). The 2-part implementation provides flexibility where you can get the agent response in the 2nd card without having to wait for all the tasks in the 1st card to finish.
Step 3: Join data from another workflow (ex: Ground Truth)
Use theJoin Workflow card to bring in data from another Workflows workflow—such as a ground truth dataset for comparison. Here we will compare our agent’s categorization against the ground truth categorization for each product.

Select Workflow: Choose the workflow containing your ground truth or reference dataJoin Type: Select how you want to combine the datasets:Left Join: Keep all rows from the current workflowRight Join: Keep all rows from the target workflowInner Join: Keep only matching rowsOuter Join: Keep all rows from both workflows
Join Keys: Select the matching columns from each workflow (e.g.,trace_id)Execute target workflow before joining: Enable this to get fresh data from the target workflow; otherwise, the most recent execution result will be used
Step 4: Use the LLM as Judge card to do evals
Add anLLM as Judge card to evaluate your agent’s outputs against the ground truth.

Select Judge: Choose to create a new judge or use an existing oneJudge Type: Select from available options likeLegacy LLM as Judge (recommended): Traditional LLM judge with custom configuration that covers most use casesBase Evals Agent as Judge: General-purpose evaluation agent with flexible capabilities
Model: Choose the LLM model to use for judging (e.g.,openai/gpt-4o)Temperature: Set to 0 for deterministic outputsLLM Judge Name: Give your judge a name for future usageRubric: Define evaluation criteria. Use{{columnName}}to reference data columns
Evaluation Details section to export your results as an evaluation that can be accessed later anytime, which is helpful if you are running the evaluations on a schedule.

Evaluation Name: A descriptive name for your evaluationDescription: Explain what this evaluation measuresTags: Add tags to filter and organize evaluations by groups laterSave judge for future reuse: Check to save this judge configuration and use it from the existing judge dropdown in other workflowsCreate new evaluation on re-run: Check to create a new evaluation each time the workflow runs
Evaluations and Dashboards tabs anytime respectively.
Step 5: Open the evals in Evaluations tab or Dashboard tab
Navigate to theEvaluations tab in SGP to view your evaluation results. You can search, filter by date range, and filter by tags to find specific evaluations.

Overview: Visual summary of evaluation metrics and response distributionsData: Detailed row-by-row results with all columnsMetadata: Information about when and how the evaluation was created
Dashboards tab for deeper analysis and custom visualizations.
Step 6: Automate the evals to run on a schedule
To automate your evaluation workflow, click the calendar icon beside the title to set a schedule.
Set a Schedule: Toggle on to enable schedulingDays: ChooseDailyto run every day, or select specific days of the weekTime: Set the frequency (e.g.,Hourly) or a specific timeEnd Date: Schedules can run for up to 30 days into the futureEnable Notifications: Toggle on to receive Slack alerts for failed runs
Save to activate your scheduled workflow. You can monitor all scheduled runs from the Execution logs in the side menu, and view Workflow history to track changes across versions.
Step 7: Create SGP Dashboards to show trends over time
You may export a workflow’s output as an evaluation anytime utilizing theExport as Evaluation card. You can run that export on a schedule. That workflow can automatically create a new Evaluation in the SGP Evaluations tab. You can click on the View Evaluation button at the bottom of the card to visit that tab anytime.

Evaluation to an Evaluation Group. To create an Evaluation Group you need 1+ Evaluations. Go to the Evaluations tab to create an Evaluation Group.

Dashboards tab and create a dashboard using the evaluation group you just created. For your use case, you will want to create a Timeseries Widget with the metrics across your evaluations inside the same group. Learn more about Evaluation Group Dashboards →

Export as Evaluation card to write every new evaluation run to the Evaluation Group you created for this dashboard. Doing so will automatically update the associated dashboard every time you run this workflow. You can visit the updated evaluation group dashboard by clicking on the Visit Evaluation Group button at the bottom of the card and then clicking on the Open in Dashboards button at the top right in the new window. Congrats! Now, you have an automated evaluation pipeline that updates metrics and visualizations over time.

Export as Evaluation step, enable the Create Dashboard toggle, and pick your old dashboard as the template such that all the necessary widgets get carried over. You can also create a new empty dashboard with this card directly without any templates.
