📘 Before you dive into the details: You may want to look at the Flexible Evaluation Recipe or the Simple Flexible Evaluation Guide to get a feel for how flexible evaluation can be used. To understand when to use flexible evaluation, see Why Use Flexible Evaluation.
schema_type="FLEXIBLE"
:
schema_type="GENERATION"
) datasets can only have strings as input
and expected_output
. Flexible evaluation datasets allow for input
and expected_output
to be a dictionary where each key is a string and each value is one of the following:
"role"
: "user"
, "assistant"
, or "system"
"content"
: string"text"
: string"metadata"
: dictionary of strings to any JSON value{"key": "value"}
{"key": [{"nested": {"hello": "world"}}]}
📘 Note that custom metrics can be used for any external app — you don’t need a flexible evaluation dataset or tracesYou can also attach custom metrics to outputs. Metrics are numerical values that can be used to record, e.g., how many tokens it took to generate an output or calculated evaluation metrics like F1 or BLEU scores. Metrics can be passed as a dictionary mapping a string key to a numeric value: