By default, the annotation UI which annotators see in SGP shows the test case input, expected output, and output. However, for complex evaluations may want to:

  • display data from the trace
  • select which parts of test case inputs and test case outputs to display
  • modify the layout the annotation UI

The Annotation Configuration allows you to do all three.

Here’s what an example annotation configuration looks like:

from scale_gp.lib.types.data_locator import data_locator # this is a helper to produce

annotation_configuration = dict(
  annotation_config_type="flexible", # this is the default, so we could have omitted annotation_config_type entirely.
  direction="row", # or "col"
  components=[ # 2D array representing how things will be layed out in the UI
    [
      dict(data_loc=["test_case_output", "output", "string_output"], label="string output"),
      dict(data_loc=["test_case_data", "expected_output", "string_expected"]),
    ],
    [
      dict(data_loc=data_locator.test_case_output.output["messages_output"]), # The data_locator is an easier way of producing data_locs
    ],
    [
      dict(data_loc=data_locator.trace["tool_call"]).input["string_input"] # reference the "tool_call" node from the trace earlier
    ]
  ]
)

evaluation = sgp_client.evaluations.create(
    account_id=ACCOUNT_ID,
    name=f"example flexible evaluation",
    description="This is a test evaluation",
    type="builder",
    evaluation_config_id=evaluation_config.id, # You need to create an evaluation config, evaluation_dataset, and application spec/variant
    evaluation_dataset_id=flexible_evaluation_dataset.id,
    application_variant_id=application_variant.id,
    application_spec_id=application.id,
  annotation_config=annotation_configuration,
)

When a contributor annotates this evaluation in the UI, they will see an annotation UI that looks something like this:

Let’s break down how a custom annotation config is set up:

  • annotation_config_type: by default this is “flexible”. The other types are “summarization” and “multiturn” which make it easier to work with specific use cases
  • components: this is a 2D list of annotation items. Each annotation item points to somewhere in the test case data, test case output, or trace. When the annotator grades the test case output, they will see data pulled from each location
    • Each annotation item has a “data_loc” field and an optional “label” field. The “data_loc” is an array that points to where annotation data should be pulled from. The “label” is a name to be displayed to a user for the “data_loc”.

      ⚠️ if a “data_loc” points somewhere that doesn’t exist for one or more test cases, you will not be able to create the evaluation.

  • direction: by default “row”. Decides whether components are laid out as rows or as columns

Here’s are some examples of how different arrangements of components produce different UIs:

data_locs can take any of these shapes:

data_locator Helperdata_loc arrayMeaning
data_locator.test_case_data.input["test_case_data", "input"]Display the entire input from the test case
data_locator.test_case_data.input["\<input key>"]["test_case_data", "input", "\<input key>"]Displays a single key from the input
data_locator.test_case_data.expected["test_case_data", "expected_output"]Display the entire expected output from the test case
data_locator.test_case_data.expected["\<expected output key>"]["test_case_data", "expected_output", "\<expected output key>"]Display a single key from the expected output
data_locator.test_case_output["test_case_output", "output"]Display the entire output from the test case output
data_locator.test_case_output["\<output key>"]["test_case_output", "output", "\<output key>"]Display a single key from the output
data_locator.trace["\<node id from the trace>"].input["trace", "\<node id from the trace>", "input"]Display the entire input from a single part of the trace
data_locator.trace["\<node id from the trace>"].input["\<input key>"]["trace", "\<node id from the trace>", "input", "\<input key>"]Display a single key from the input of a part of the trace
data_locator.trace["\<node id from the trace>"].output["trace", "\<node id from the trace>", "output"]Display the entire output from a single part of the trace
data_locator.trace["\<node id from the trace>"].output["\<output key>"]["trace", "\<node id from the trace>", "output", "\<output key>"]Display a single key from the output of a part of the trace
data_locator.trace["\<node id from the trace>"].expected["trace", "\<node id from the trace>", "expected"]Display the entire expected output from a single part of the trace
["trace", "\<node id from the trace>", "expected", "\<expected key>"]data_locator.trace["\<node id from the trace>"].expected["\<expected key>"]Display a single key from the expec output of a part of the trace

It is highly recommended that you use the data_locator helper instead of manually creating the data_loc array.

Customizing the Annotation UI per question

Sometimes, you have certain questions in an evaluation rubric that are relevant only to a specific part of the test case, test case output or trace. For instance, you might ask a question specifically about the “completion” or “reranking” step in the trace.

In that case you can create a question_id_to_annotation_config mapping that lets you override the annotation config for a specific question ID:

question_id_to_annotation_config = {
    questions[1].id: dict(
        components=[
            [
                dict(
                    data_loc=data_locator.trace["completions"].input,
                    label="string output",
                ),
                dict(
                    data_loc=data_locator.trace["completions"].output
                )
            ],
            [
                dict(
                    data_loc=data_locator.trace["completions"].expected
                ),
            ],
        ],
    )
}

evaluation = sgp_client.evaluations.create(
  ... # specify all the usual evaluation fields,
  annotation_config=annotation_configuration,
  question_id_to_annotation_configuration=queston_id_to_annotation_config # where specified, overrides the annotation_config
)

In the annotation UI, the information rendered will now change for each respective evaluation question as mapped above:

Dev Mode

SGP also supports “Dev Mode” which allows an annotator to view all the inputs, outputs and the full trace all at once. You can toggle Dev Mode by clicking on the top right in the annotation UI: