While our standard evaluations require following a simple, single step and single-turn input <> output schema, flexible evaluations use a fully flexible schema enabling users to test all kinds of GenAI use cases at various complexity levels.

This guide outlines step by step how to use flexible evaluations:

  1. Create a Flexible Evaluation Datasets with test cases that have multiple inputs and expected outputs of various types
  2. Generate Flexible Outputs from your application that can contain multiple outputs of complex types
  3. Attach Traces to Outputs to record the intermediate step your application took to arrive at the final output
  4. Attach Metrics to Outputs to record numerical values associated with the output, such as custom automatic evaluations
  5. Customizing the annotation UI to allow human annotators to see the data that is relevant for them to annotate.

📘 Before you dive into the details:

You may want to look at the Flexible Evaluation Recipe or the Simple Flexible Evaluation Guide to get a feel for how flexible evaluation can be used. To understand when to use flexible evaluation, see Why Use Flexible Evaluation.

Flexible Evaluation Datasets

To get started with flexible evaluations, you need a new evaluation dataset with schema_type="FLEXIBLE":

from scale_gp import SGPClient

ACCOUNT_ID = ... # your account id here
API_KEY = ... # your API key here

sgp_client = SGPClient(api_key=API_KEY, account_id=ACCOUNT_ID)
flexible_evaluation_dataset = sgp_client.evaluation_datasets.create(
    name="Example flexible evaluation dataset", # specify all the usual evaluation dataset fields
    account_id=ACCOUNT_ID,
    schema_type="FLEXIBLE" # This is the most important part
)

Evaluation datasets are wrappers for test cases, so we need to add test cases next.

The test cases in standard (schema_type="GENERATION") datasets can only have strings as input and expected_output. Flexible evaluation datasets allow for input and expected_output to be a dictionary where each key is a string and each value is one of the following:

  • String
  • Number (i.e., integer or float)
  • Messages (list of objects with “role” and “content” fields)
    • "role": "user", "assistant", or "system"
    • "content": string
    • Example:
    [
      { 
        "role": "user", 
        "content": "What is albert einstein known for"
      }, 
      {
        "role": "assistant", 
        "content": "Albert Einstein is known for this contributions to theoretical physics..."
      }
    ]
    
  • Chunks (list of objects with “text” and optionally a “metadata” field)
    • "text": string
    • "metadata": dictionary of strings to any JSON value
    • Example:
    [
      {"text": "The quick brown fox jumps over the lazy dog"}, 
      {"text": "Lorem ipsum dolor...", 
      "metadata": {
        "language": "latin", "page_number": 16
        }
      }
    ]
    
  • List of any JSON value
    • Example:
    [
      1, 
      {
        "key": "value", 
        "nested": [{}]
      }, 
      null
    ]
    
  • JSON object
    • Example: {"key": "value"}
    • Example: {"key": [{"nested": {"hello": "world"}}]}

Here’s an example of creating a flexible test case:

flexible_test_case = sgp_client.evaluation_datasets.test_cases.create(
    evaluation_dataset_id=flexible_evaluation_dataset.id, # specify all the usual test case fields
    test_case_data=dict(
        input={
            "string_input": "string",
            "number_input": 100.101,
            "messages_input": [{"role": "user", "content": "..."}],
            "chunks_input": [{"text": "...", "metadata": {"meta_key": "..."}}],
            "list_input": [1, "hello"],
            "json_object_input": {"hello": "world"}
        },
        expected_output={
            "string_expected": "string",
            "number_expected": 100.101,
            "messages_expected": [{"role": "user", "content": "..."}],
            "chunks_expected": [{"text": "...", "metadata": {"meta_key": "..."}}],
            "list_expected": [1, "hello"],
            "json_object_expected": {"hello": "world"}
        }
    ),
)

sgp_client.evaluation_datasets.publish(evaluation_dataset_id=flexible_evaluation_dataset.id)

After publishing the flexible evaluation dataset we can view it in the UI:

Flexible Outputs

After you create a flexible evaluation dataset, you can create a test case output for each input which represent the outputs from running an application on a test case. Before you do this, you’ll need to create an external application so you can tie your test case outputs to the application.

Test case outputs generated from flexible evaluation datasets can also accept a dictionary where each key is a string and each value is one of (just like in flexible test cases):

  • String
  • Number
  • Messages
  • Chunks
  • List of any JSON value
  • JSON object

Here’s an example of uploading test case outputs:

# you need to create an external application
# Follow the linked guide to create one
application_variant_id = ... 

test_case_outputs = sgp_client.application_test_case_outputs.batch(
    items=[
        dict(
            test_case_id=flexible_test_case.id, # specify all the usual test case output fields
            evaluation_dataset_version_num=1,
            application_variant_id=applicaton_variant_id,
            output=dict( # The output here is a dictionary, so you can have multiple outputs. here we have 6!
                generation_output={
                    "string_output": "string",
                    "number_output": 100.101,
                    "messages_output": [{"role": "user", "content": "..."}],
                    "chunks_output": [{"text": "...", "metadata": {"meta_key": "..."}}],
                    "list_output": [1, "hello"],
                    "json_object_output": {"hello": "world"}
                }
            )
        )
    ]
)

Attaching Traces to Outputs

While having multiple inputs and outputs helps, many complex or agentic AI applications have multiple intermediate steps (e.g. reasoning, retrieval, tool use) that are crucial to evaluate so we can understand what’s happening inside our application. Attaching traces to test case outputs allows us to record all of these intermediate steps.

A trace keeps a record of the inputs and outputs of every step as your application executes. It’s operation input and operation output must be a dictionary of string keys to values of type string, number, messages, etc., just like the input to flexible test cases.

from datetime import datetime, timedelta
start = datetime.now()
trace = [
  {
    "node_id": "tool_call", # an ID that identifies what this step is
    "operation_input": {"string_input": flexible_test_case.test_case_data["input"]["string_input"]},
    "operation_output": {
      "initial_plan": "I need to analyze how the corporate debt cycle in the U.S. interacts with the overall business cycle by reviewing fundamental economic theory and models.\n\neconomic_textbook_search_tool(query='How does the corporate debt cycle relate to the business cycle?')",
    },
    "start_timestamp": start.isoformat(),
    "duration_ms": 1000
  },
  {
    "node_id": "completion",
    "operation_type": "COMPLETION", # tag that this is a completion operation. Note that this is optional; if not specified, then will default to "CUSTOM"
    "operation_input": {
        "initial_plan": "I need to consult foundational economic theory to understand the relationship between corporate debt cycles and business cycles.\n\neconomic_textbook_search_tool(query='How does the corporate debt cycle relate to the business cycle?')",
    },
    "operation_output": {
        "string_output": "string",
        "number_output": 100.101,
        "messages_output": [{"role": "user", "content": "..."}],
        "chunks_output": [{"text": "...", "metadata": {"metadata": "..."}}],
        "list_output": [1, "hello"],
        "json_object_output": {"hello": "world"}
    },
    "start_timestamp": (start + timedelta(milliseconds=1000)).isoformat(),
    "duration_ms": 1000
  },
]

This is how you can attach the trace to a test case output:

test_case_outputs = sgp_client.application_test_case_outputs.batch(
    items=[
        dict(
            test_case_id=flexible_test_case.id,
            evaluation_dataset_version_num=1,
            application_variant_id=applicaton_variant_id,
            output=dict(  # You can attach a trace
              generation_output={
                "string_output": "string",
                "number_output": 100.101,
                "messages_output": [{"role": "user", "content": "..."}],
                "chunks_output": [{"text": "...", "metadata": {...}}],
                "list_output": [1, "hello"],
                "json_object_output": {"hello": "world"}
              }
            ),
            trace_spans=trace,
        )
    ]
)

Typically you would want to generate traces automatically in your external application. Here’s what that would look like building on the External Applications guide:

from typing import Any
from datetime import datetime
from scale_gp.lib.external_applications import ExternalApplicationOutputCompletion, ExternalApplication

def my_app(input: Dict[str, Any]):
    trace = []

    start = datetime.now()
    input_string = input["string_input"]
    output_tool_call = ... # do something here
    trace.append({
        "node_id": "tool_call",
        "operation_input": {
            "string_input": input_string,
        },
        "operation_output": {
            "initial_plan": output_tool_call,
        },
        "operation_expected": { # optionally provide an operation expected
    	    "economic_textbook_search_tool(query='How does the corporate debt cycle relate to the business cycle?')"
        }
        "start_timestamp": start.isoformat(),
        "time_elapsed": (datetime.now() - start).total_seconds() * 1000
    })

    start = datetime.now()
    output_completion = ... # do something here
    trace.append({
        "node_id": "completion",
        "operation_type": "COMPLETION",
        "operation_input": {
            "initial_plan": output_tool_call
        },
        "operation_output": output_completion,
        "start_timestamp": start.isoformat(),
        "time_elapsed": (datetime.now() - start).total_seconds() * 1000
    })

    return ExternalApplicationOutputCompletion(
        generation_output=output_completion,
        trace_spans=trace,
    )

external_app_helper = ExternalApplication(sgp_client).initialize(application_variant_id=application_variant.id, application=my_app)

# This will run the application, automatically generate traces and upload the results to SGP
external_app_helper.generate_outputs(evaluation_dataset_id=flexible_evaluation_dataset.id, evaluation_dataset_version=1)

Attaching Custom Metrics to Outputs

📘 Note that custom metrics can be used for any external app — you don’t need a flexible evaluation dataset or traces

You can also attach custom metrics to outputs. Metrics are numerical values that can be used to record, e.g., how many tokens it took to generate an output or calculated evaluation metrics like F1 or BLEU scores.

Metrics can be passed as a dictionary mapping a string key to a numeric value:

test_case_outputs = sgp_client.application_test_case_outputs.batch(
  items=[
    dict(
      test_case_id=flexible_test_case.id, # specify all the usual test case output fields
      ...,
      metrics={
        "tokens_used"=1234,
        "ROUGE"=0.8,
      }
    ),
    dict(
      test_case_id=other_test_case.id,
      ...,
      metrics={
        "tokens_used"=1950,
        "ROUGE"=0.94,
      }
    )
  ]
)

You can create an evaluation to see metrics on the Metrics tab or the Table tab:

Note that you can configure how metrics are aggregated on Metrics and filter by metric values on the Table.

Create a custom Annotations UI

By default, the annotation UI which annotators see in SGP shows the test case input, expected output, and output.

However, for complex evaluations may want to:

  • display data from the trace
  • select which parts of test case inputs and test case outputs to display
  • modify the layout the annotation UI

To customize the configuration of the annotation UI, see see Create a custom Annotations UI.

Flexible Evaluation Limitations

Flexible evaluations are currently only available for human evaluations. Auto-evaluations cannot be used flexible datasets or annotation configurations. Flexible evaluations are currently only supported for external applications and can only be triggered via the SDK. We will add the support for native SGP applications on flexible evaluations soon.

What’s Next

To get started building with flexible evaluations, take a look at the Flexible Evaluation recipe.