Overview

Evaluations have been overhauled in the upcoming iteration of SGP to decrease overhead while increasing capabilities for both UI and SDK users. The API is currently available in the latest version of SGP, and access to the new UI is available behind a feature flag, which can be enabled for an account and/or deployment on request.

At the most basic level, an evaluation now consists of:

  1. Data: Unstructured, user defined data, optionally, in the form of a reusable dataset.
  2. Tasks: A set of instructions to produce results based on said data. For example, a chat completion task based on an input field within the user defined data, or a human annotation task to produce the expected output. These tasks can be interdependent and composed to create as simple or complex of an evaluation that your use case requires.

Creation

Examples in this guide use the scale-gp-beta package which runs exclusively on the V5 API.

The following example illustrates how to create a simple evaluation through the SDK:

from scale_gp_beta import SGPClient

client = SGPClient(account_id=..., api_key=...)

my_evaluation = client.evaluations.create(
  name="My Example Evaluation",
  data=[
    { "input": "How does NextGen eval on SGP work?" },
    { "input": "How do I define tasks?" },
  ],
  tasks=[
    {
      "task_type": "chat_completion",
      "alias": "output",
      "configuration": {
        "model": "openai/gpt-4o",
        "messages": [
          {
            "role": "user",
            "content": "item.input"
          }
        ]
      }
    }
  ]
)

In this case, our data is made up of two items, each with an input field. A single task is specified to generate a chat completion from each data item.

Data

Each object within the data field can be thought of as a row in a table. We refer to these rows as items, and they can be retrieved like so:

client.evaluation_items.list(
  evaluation_id=my_evaluation.id
)

Optionally, a dataset can be created for reusability and comparison across different evaluations:

my_dataset = client.datasets.create(
  name="My Example Dataset",
  data=[
    { "input": "How does NextGen eval on SGP work?" },
    { "input": "How do I define tasks?" }
  ]
)

my_evaluation = client.evaluations.create(
  name="My Example Evaluation",
  dataset_id=my_dataset.id
  tasks=[
    ...
  ]
)

Tasks

Tasks can be thought of as instructions to produce “results” for each item within the evaluation, and are run asynchronously after evaluation creation. They contain 3 parts:

task_type
enum
required

The type of task to run (a list of supported types can be found here)

configuration
object
required

Used to specify task_type specific parameters, such as the messages field for a chat_completion task.

alias
string
default:"<task_type>"

Aliases are an optional way to specify the field name that will contain the task’s result for each evaluation item. By default, this will be the task_type.