> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Introduction to Next Gen Evals

> A simplified and extendable update to SGP's evaluation framework

## Overview

Evaluations have been overhauled in the upcoming iteration of SGP to decrease overhead while increasing capabilities for both UI and SDK users. The API is currently available in the latest version of SGP, and access to the new UI is available behind a feature flag, which can be enabled for an account and/or deployment on request.

At the most basic level, an evaluation now consists of:

1. **Data**: Unstructured, user defined data, optionally, in the form of a reusable dataset.
2. **Tasks**: A set of instructions to produce results based on said data. For example, a chat completion task based on an `input` field within the user defined data, or a human annotation task to produce the expected output. These tasks can be interdependent and composed to create as simple or complex of an evaluation that your use case requires.

## Creation

<Info>Examples in this guide use the [`scale-gp-beta`](https://pypi.org/project/scale-gp-beta/) package which runs exclusively on the V5 API.</Info>

### Simple evaluation creation via SDK

The following example illustrates how to create a simple evaluation through the SDK:

```python {8-9, 12-24} theme={null}
from scale_gp_beta import SGPClient

client = SGPClient(account_id=..., api_key=...)

my_evaluation = client.evaluations.create(
  name="My Example Evaluation",
  data=[
    { "input": "How does NextGen eval on SGP work?" },
    { "input": "How do I define tasks?" },
  ],
  tasks=[
    {
      "task_type": "chat_completion",
      "alias": "output",
      "configuration": {
        "model": "openai/gpt-4o",
        "messages": [
          {
            "role": "user",
            "content": "item.input"
          }
        ]
      }
    }
  ]
)
```

In this case, our `data` is made up of two items, each with an `input` field. A single task is specified to generate a chat completion from each data item.

### Evaluation with files creation via SDK

Files can be associated with evaluations for use in multimodal tasks or when your evaluation items need to reference uploaded content. First upload your files, then include them in the evaluation:

<Warning>
  Evaluations and datasets that contain files must be created via the API or SDK. This functionality is not currently supported in the UI.
</Warning>

```python theme={null}
from scale_gp_beta import SGPClient

client = SGPClient(account_id=..., api_key=...)

# Create evaluation with files
my_multimodal_evaluation = client.evaluations.create(
  name="My Multimodal Evaluation",
  data=[
    { "input_column": "This is the first row of the evaluation" },
    { "input_column": "This is the second row of the evaluation" },
  ],
  files=[
    {"file_column1": "sample_file_id_1", "file_column2": "sample_file_id_2"},
    {"file_column1": "sample_file_id_3", "file_column2": "sample_file_id_4"},
  ],
  tasks=[
    {
      "task_type":"contributor_evaluation.question",
      "alias":"Mispronunciation timestamps_61142007",
      "configuration":{
        "question_id":"61142007-bd18-48a4-b034-a0670a4a23f5",
        "queue_id":"default",
        "layout":{
            "direction":"column",
            "children":[
              {
                  "data":"item.input"
              },
              {
                  "data":"item.files.file_column1"
              }
            ]
        },
        "required":false
      }
    }
  ]
)
```

In this example, already uploaded files are referenced in the `files` array and they have a one to one mapping to the `data` array. This means that in the resultant evaluation, the first item from data and and the first item from files will collectively form the first row.

## Data

Each object within the `data` field can be thought of as a row in a table. We refer to these rows as items, and they can be retrieved like so:

```python theme={null}
client.evaluation_items.list(
  evaluation_id=my_evaluation.id
)
```

<Expandable title="example output" defaultOpen={true}>
  ```python {5,11} theme={null}
  [
    {
      "id": "0dc5611a-...",
      "object": "evaluation.item"
      "data": { "input": "How does NextGen eval on SGP work?" },
      ...
    },
    {
      "id": "1549a01e-...",
      "object": "evaluation.item"
      "data": { "input": "How do I define tasks?" },
      ...
    }
  ]
  ```
</Expandable>

Optionally, a dataset can be created for reusability and comparison across different evaluations:

```python {1-7, 11} theme={null}
my_dataset = client.datasets.create(
  name="My Example Dataset",
  data=[
    { "input": "How does NextGen eval on SGP work?" },
    { "input": "How do I define tasks?" }
  ]
)

my_evaluation = client.evaluations.create(
  name="My Example Evaluation",
  dataset_id=my_dataset.id
  tasks=[
    ...
  ]
)
```

## Tasks

Tasks can be thought of as instructions to produce "results" for each item within the evaluation, and are run asynchronously after evaluation creation. They contain 3 parts:

<ResponseField name="task_type" type="enum" required>
  The type of task to run (a list of supported types can be found [here](/reference/v5/evaluations/create-evaluation#body-tasks))
</ResponseField>

<ResponseField name="configuration" type="object" required>
  Used to specify `task_type` specific parameters, such as the `messages` field for a `chat_completion` task.

  <Accordion title="Referencing item data">
    A string following the format `"item.<field>"`, here on referred to as an `ItemLocator`, can be used within a configuration to indicate that the value should be pulled from the evaluation item at task execution time.

    For example, in the first code snippet we defined a `chat_completion` task that used an `ItemLocator` as the `content` field for a user message:

    ```python {6} theme={null}
    {
      "model": "openai/gpt-4o",
      "messages": [
        {
          "role": "user",
          "content": "item.input"
        }
      ]
    }
    ```

    When the task is ultimately executed, a chat completion request will be made for each evaluation item with the `content` field populated based on the item's `input` field.

    Data can also be referenced within a string by wrapping an `ItemLocator` in double curly braces:

    ```python {6} theme={null}
    {
      "model": "openai/gpt-4o",
      "messages": [
        {
          "role": "user",
          "content": "Answer the following question: {{item.input}}"
        }
      ]
    }
    ```
  </Accordion>
</ResponseField>

<ResponseField name="alias" type="string" default="<task_type>">
  Aliases are an optional way to specify the field name that will contain the task's result for each evaluation item. By default, this will be the `task_type`.
</ResponseField>
