Create a Multi-Stage (Flexible) Evaluation

https://pypi.org/project/scale-egp/

# Prerequisite: pip install -U scale-egp

import os
import time
from typing import Any, Dict, List
from datetime import datetime

from scale_gp import SGPClient
from scale_gp.lib.external_applications import ExternalApplication, ExternalApplicationOutputCompletion
from scale_gp.types.evaluation_datasets.test_case import TestCase
from scale_gp.types.application_test_case_output_batch_params import (
  ItemTraceSpan,
)
from scale_gp.types.evaluation_datasets.test_case_batch_params import Item
from scale_gp.types.evaluation_datasets.test_case_create_params import TestCaseData

Fetch your API Key from: https://gp.scale.com/admin/api-key

Fetch your Account ID from: https://gp.scale.com/admin/accounts

All resources you interact with using this client will belong to this account.

Note: If you are using your own VPC-deployed version of Scale GP, you will have a different endpoint_url. For users of our multi-tenant platform, use https://gp.scale.com


client = SGPClient(api_key=api_key)

An evaluation dataset is a set of test cases used to evaluate or benchmark the performance of an application.

In here, we create an evaluation dataset with schema_type FLEXIBLE. This indicates that the dataset is capable of receiving a variety of data types.

The supported types for test case input and output are as follows: Boolean Messages: Array of Dictionaries with “role” and “user” keys Dictionary Array (entries can contain Booleans, Floats, Integers, Dictionaries) Chunks: Array of Dictionaries with “text” and “metadata” keys Document: String

We then proceed to batch insert a few example test cases into the dataset and create a snapshot.

flexible_evaluation_dataset = client.evaluation_datasets.create(
    account_id=account_id,
    name="flexible_evaluation_dataset",
    schema_type="FLEXIBLE",
    type="manual",
)

TEST_CASES = [
    {
        "input": "Who is the first president of the united states?",
        "expected_output": "George Washington",
    },
    {
        "input": {"What is the name for this many dollars?": 1000000},
        "expected_output": "The number of dollars in a megadollar",
    },
    {
        "input": {
            "question_type": "who",
            "question": "Who is the second president of the united states?",
        },
        "expected_output": {
            "percieved_mode": "declaratory",
            "answer": "John Adams",
        },
    }
]
items: List[Item] = [Item({"account_id": account_id, "test_case_data": item}) for item in TEST_CASES]
uploaded_test_cases = client.evaluation_datasets.test_cases.batch(
    evaluation_dataset_id=flexible_evaluation_dataset.id,
    items=[
        {"account_id": account_id, "test_case_data": test_case} for test_case in TEST_CASES
    ],
)
test_case_ids = [test_case.id for test_case in uploaded_test_cases]

# publish the dataset
client.evaluation_datasets.publish(
    evaluation_dataset_id=flexible_evaluation_dataset.id
)

Create a question set. These questions will be used by either humans or an LLM to evaluate the results of the test cases. The question set must contain at least one question to be a valid question set.

questions = [
    {
        "type": "categorical",
        "title": "Test Question",
        "prompt": "Test Prompt",
        "choices": [{"label": "No", "value": 0}, {"label": "Yes", "value": 1}],
        "account_id": account_id,
    },
    {
        "type": "free_text",
        "title": "Question only about the document",
        "prompt": "What is the document about?",
        "account_id": account_id,
    },
]

question_ids = []
for question in questions:
    created_question = client.questions.create(**question)
    question_ids.append(created_question.id)

# create a question set using the desired questions
question_set = client.question_sets.create(
    account_id=account_id,
    name="US Presidents Question Set",
    question_ids=question_ids,
)

An evaluation configuration is used to determine which question set and evaluation method (human or LLM) is going to be used.

evaluation_config = client.evaluation_configs.create(
    account_id=account_id,
    evaluation_type="human",
    question_set_id=question_set.id,
)

If you haven’t yet, create an external application and variant for the application you want to evaluate. If you already have an external application and variant for the application you want to use, you can use the IDs you grabbed from your development environment during the setup.

You will add the dataset and outputs you want to evaluate for this variant in a later step.

application_spec = client.application_specs.create(
    account_id=account_id,
    description="Test application spec",
    name="test-application-spec" + str(time.time()),
)
external_application_spec = application_spec.id

application_variant = client.application_variants.create(
    account_id=account_id,
    application_spec_id=external_application_spec,
    name="test offline application variant",
    version="OFFLINE",
    description="Test application variant",
    configuration={},
)
external_application_variant = application_variant.id

Next, you will use the ExternalApplication class and generator function to add outputs to the application variant you have created. This is almost the same as a normal external application, but you will be able to also add metrics and traces here as well. To do this, first define an application function that will generate and return the outputs, metrics, and trace spans for each prompt and test case in your dataset. Refer to the sample code for an example of what each of the outputs should look like in the application function.

Next, you will initialize an external application based on the application variant based on the application function and variant id from the selected variant.

Calling “generate_outputs” on this external application instance will add the test case outputs to your variant based on the “application” function defined earlier.

def application(prompt: Dict[str, Any], test_case: TestCase) -> ExternalApplicationOutputCompletion:
    response = (
        "application response from external application\n"  # Call your application with test case input here (test_case.test_case_data.input)
    )
    metrics = {"accuracy": 0.9}  # whatever metrics you want to track
    trace_spans: List[ItemTraceSpan] = [  # whatever traces you want to track, such as what the output of each node inside the application was
        {
            "node_id": "completion",
            "start_timestamp": datetime.now().replace(microsecond=5000),
            "operation_input": {"input": "test input"},
            "operation_output": {"completion_output": "test output"},
            "duration_ms": 1000,
        },
        {
            "node_id": "retrieval",
            "start_timestamp": datetime.now().replace(microsecond=5000),
            "operation_input": {"input": "test input"},
            "operation_output": {"output": "test output"},
            "duration_ms": 1000,
        },
    ]
    return ExternalApplicationOutputCompletion(generation_output=response, metrics=metrics, trace_spans=trace_spans)

external_application = ExternalApplication(client).initialize(
    application_variant_id=external_application_variant,
    application=application,
)

external_application.generate_outputs(
    evaluation_dataset_id=flexible_evaluation_dataset.id,
    evaluation_dataset_version=1,
)

generated_test_case_outputs = client.application_test_case_outputs.list(
    account_id=account_id,
    application_variant_id=external_application_variant,
    evaluation_dataset_id=flexible_evaluation_dataset.id,
    evaluation_dataset_version_num=1,
)

After everything is configured, use the “evaluations.create” function to create an evaluation for the application variant, dataset, and configuration created from the previous steps. The evaluation will only be created if each of the configurations are properly configured in previous steps.

evaluation = client.evaluations.create(
    type="builder",
    account_id=account_id,
    application_spec_id=external_application_spec,
    application_variant_id=external_application_variant,
    description="description",
    evaluation_dataset_id=flexible_evaluation_dataset.id,
    name="Flexible eval",
    evaluation_config_id=evaluation_config.id,)

Recipes