Full Guide To Flexible Evaluation
While our standard evaluations require following a simple, single step and single-turn input <> output schema, flexible evaluations use a fully flexible schema enabling users to test all kinds of GenAI use cases at various complexity levels.
This guide outlines step by step how to use flexible evaluations:
- Create a Flexible Evaluation Datasets with test cases that have multiple inputs and expected outputs of various types
- Generate Flexible Outputs from your application that can contain multiple outputs of complex types
- Attach Traces to Outputs to record the intermediate step your application took to arrive at the final output
- Attach Metrics to Outputs to record numerical values associated with the output, such as custom automatic evaluations
- Customizing the annotation UI to allow human annotators to see the data that is relevant for them to annotate.
📘 Before you dive into the details:
You may want to look at the Flexible Evaluation Recipe or the Simple Flexible Evaluation Guide to get a feel for how flexible evaluation can be used. To understand when to use flexible evaluation, see Why Use Flexible Evaluation.
Flexible Evaluation Datasets
To get started with flexible evaluations, you need a new evaluation dataset with schema_type="FLEXIBLE"
:
Evaluation datasets are wrappers for test cases, so we need to add test cases next.
The test cases in standard (schema_type="GENERATION"
) datasets can only have strings as input
and expected_output
. Flexible evaluation datasets allow for input
and expected_output
to be a dictionary where each key is a string and each value is one of the following:
- String
- Number (i.e., integer or float)
- Messages (list of objects with “role” and “content” fields)
"role"
:"user"
,"assistant"
, or"system"
"content"
: string- Example:
- Chunks (list of objects with “text” and optionally a “metadata” field)
"text"
: string"metadata"
: dictionary of strings to any JSON value- Example:
- List of any JSON value
- Example:
- JSON object
- Example:
{"key": "value"}
- Example:
{"key": [{"nested": {"hello": "world"}}]}
- Example:
Here’s an example of creating a flexible test case:
After publishing the flexible evaluation dataset we can view it in the UI:
Flexible Outputs
After you create a flexible evaluation dataset, you can create a test case output for each input which represent the outputs from running an application on a test case. Before you do this, you’ll need to create an external application so you can tie your test case outputs to the application.
Test case outputs generated from flexible evaluation datasets can also accept a dictionary where each key is a string and each value is one of (just like in flexible test cases):
- String
- Number
- Messages
- Chunks
- List of any JSON value
- JSON object
Here’s an example of uploading test case outputs:
Attaching Traces to Outputs
While having multiple inputs and outputs helps, many complex or agentic AI applications have multiple intermediate steps (e.g. reasoning, retrieval, tool use) that are crucial to evaluate so we can understand what’s happening inside our application. Attaching traces to test case outputs allows us to record all of these intermediate steps.
A trace keeps a record of the inputs and outputs of every step as your application executes. It’s operation input and operation output must be a dictionary of string keys to values of type string, number, messages, etc., just like the input to flexible test cases.
This is how you can attach the trace to a test case output:
Typically you would want to generate traces automatically in your external application. Here’s what that would look like building on the External Applications guide:
Attaching Custom Metrics to Outputs
📘 Note that custom metrics can be used for any external app — you don’t need a flexible evaluation dataset or traces
You can also attach custom metrics to outputs. Metrics are numerical values that can be used to record, e.g., how many tokens it took to generate an output or calculated evaluation metrics like F1 or BLEU scores.
Metrics can be passed as a dictionary mapping a string key to a numeric value:
You can create an evaluation to see metrics on the Metrics tab or the Table tab:
Note that you can configure how metrics are aggregated on Metrics and filter by metric values on the Table.
Create a custom Annotations UI
By default, the annotation UI which annotators see in SGP shows the test case input, expected output, and output.
However, for complex evaluations may want to:
- display data from the trace
- select which parts of test case inputs and test case outputs to display
- modify the layout the annotation UI
To customize the configuration of the annotation UI, see see Create a custom Annotations UI.
Flexible Evaluation Limitations
Flexible evaluations are currently only available for human evaluations. Auto-evaluations cannot be used flexible datasets or annotation configurations. Flexible evaluations are currently only supported for external applications and can only be triggered via the SDK. We will add the support for native SGP applications on flexible evaluations soon.
What’s Next
To get started building with flexible evaluations, take a look at the Flexible Evaluation recipe.