Evaluations
Evaluating large language models is intrinsically difficult because of the subjective nature of responses to open ended requests.
Why Are Evaluations Important?
Using LLMs as evaluators are known to suffer from bias, verbosity, self-enhancement (favor answers generated by themselves) and others. Many prompts require domain expertise to evaluate accurately. Finally, evaluations are only as good as their evaluation sets - diversity and comprehensiveness are key.
However, getting Evaluations right is essential for enterprises. Companies need a way to:
- Reliably test their custom models for regressions and compare different model versions
- Identify new weak spots / blind spots in their models
- Test and verify a model is ready before it is deployed
Running Evaluations using the SGP Python SDK
To perform an evaluation, the user begins by setting up the foundational elements – the dataset, Studio project, and project registration. Subsequently, the evaluation tasks, including data generation, annotation, monitoring, analysis, and visualization, are carried out iteratively for each evaluation.
One-Time Setup
- Dataset Creation: To initiate the evaluation process, the user creates a dataset containing various test cases. These test cases serve as input data for the generative AI project and provide diverse scenarios for evaluation.
- Studio Project Setup: In preparation for human annotation, the user establishes a Studio project. This project acts as the platform where human annotators assess the AI-generated responses. Setting up the Studio project is a crucial one-time step to enable evaluations involving human input.
- Application Spec Registration: The user registers some metadata about their generative AI project, providing necessary details such as its name, description, and version. This registration associates the project with the evaluation process, ensuring that its performance can be systematically assessed.
Recurring Evaluation Tasks
- Data Generation and Annotation: For each test case within the dataset, the generative AI project generates responses. Simultaneously, metadata and additional information chunks are collected. This process is repeated for each test case, ensuring a comprehensive evaluation of the project’s capabilities across various inputs.
- Evaluation Monitoring: The system continuously monitors the progress of ongoing evaluations. It checks the status of each evaluation, ensuring that the assessments are being conducted as expected. This monitoring step is repeated for every evaluation initiated by the user.
- Analysis and Visualization: Post-evaluation, the user retrieves the results and conducts an in-depth analysis. Performance metrics are processed and organized, allowing for comparisons across different evaluations or specific test cases. This analysis provides valuable insights into the project’s strengths and areas for improvement, contributing to informed decision-making. The SGP Python SDK provides a simple yet powerful set of tools to create these evaluations.
End to End Evaluation Example Workflow
Basic Setup
First however, we need run a small boilerplate:
Evaluation Dataset
An evaluation dataset is a list of test cases that users want to benchmark their project’s performance on. A user will create an evaluation dataset by uploading a CSV file and specifying the schema of their dataset. The dataset schema is specified at creation time. At first, the only schema that we will support is a CSV file with a required input column and optional expected_output and expected_extra_info columns where input refers to the end-user prompt, expected_output refers to the expected output of the AI project, and expected_extra_info refers to any additional information that the AI project should have used to generate the output. This schema definition will be designed as a GENERATION schema. Because we don’t flatten the columns in our Postgres database, this design allows for flexible schema definitions. In the future, we can support additional schema types as needed. Defining this schema also allows users to understand how to read the dataset and what to expect in each row. This makes it easy for users to pull datasets that they did not create and use them in standardized processing scripts.
Application
This is simply a metadata entry that describes an end user application. This is useful as a grouping mechanism to allow users to relate multiple evaluations together to a single user application. It also allows the user to specify the detailed state and components of the current application, i.e. which kind of retrieval components were used in conjunction with the LLM if any.
Annotation Project
Registers an annotation project using Scale Studio. This is only used if Scale Studio is the platform used for annotations. Learn more about Scale Studio.
Sometimes evaluations require that external resources be created. Because these platforms do not need to share any properties it makes most sense to regard them as entirely separate components. For example, since the primary annotation platform we will use is Studio, we will allow users to create Studio projects using SGP APIs and store references to these projects in our data tables. This way, once an admin user creates a Studio project and adds taskers to it in the Studio UI, developers can create evaluations to send evaluation tasks to centralized projects.
Evaluation
This refers to the action of sending off a batch of tasks to evaluate a specific iteration of a user application. It will contain references to the application’s current state, the id of the application this evaluation is for, the status of the evaluation, as well as any configuration that is needed for the annotation mechanism (i.e. Studio task project id / questions, Auto LLM model name / questions, client-side function name, etc.) For each evaluation, a dataset will be referenced. The customer application will iterate through each row in the dataset to get the input value of the dataset. The customer application will generate an output for each input value in the dataset.
Get Test Case Results
Each evaluation consists of test case results where the results of all test case evaluations are stored. By looking at a single test case result, you can see which test case the result is for, the dataset that test case belonged to, and which evaluation the result was a part of. Because the application state can be defined in the tags of the evaluation definition, the user can also refer to the state of the application that was used to generate the outputs for this result. Because the test-case information is stored on a per-result basis, the history of all annotation results for a given dataset row item can be constructed.