Multiturn Evaluation
How to create and evaluate a multiturn application
Many use cases for GenAI applications follow a multiturn pattern. Users interact with the application in a conversational format, requiring conversational evaluations. The input to a multiturn evaluation can be a single message or conversation.
These patterns are natively supported with Flexible Evaluation runs. In this guide, we will walk through the creation of a multiturn evaluation step by step.
Initialize the SGP client
Follow the instructions in the Quickstart Guide to setup the SGP Client. After installing the client, you can import and initialize the client as follows:
Upload Multiturn Dataset
To evaluate your multiturn application, you will need an evaluation dataset with test cases. Each test case should include a list of messages as the conversation input.
Once you have all your test cases ready, you can upload your data via the DatasetBuilder
Application Setup
Evaluations are tied to an external application variant. You will first need to initialize an external application before you can evaluate your muliturn application.
To create the variant, navigate to the “Applications” page on the SGP dashboard, click Create a new Application, and select External AI as the application template.
You can find the application_variant_id
in the top right:
Generate responses
You will need to create a function to run inference for your app.
You can now connect your application to your local inference via the code snippet below. The generate_outputs
function will run you application on the dataset and upload responses to SGP.
You can verify these outputs by viewing the variant and clicking on the dataset:
Create evaluation
Once you have your data uploaded, we are ready to start an evaluation. You will need an evaluation config. Visit the recipe for code snippets on how to do this.
To present annotators with the full conversation, you can create a custom annotation configuration. SGP has a pre-configured view for multiturn that can be pointed to pull the conversation from anywhere in your output.
This will create an annotation view like the one below:
Lastly, create the evaluation:
Annotate Tasks
Once you have created your evaluation, you will find a new task queue in the annotation tab.
The annotator will be presented with the following annotator view: