Evaluating large language models is intrinsically difficult because of the subjective nature of responses to open ended requests.
Using LLMs as evaluators are known to suffer from bias, verbosity, self-enhancement (favor answers generated by themselves) and others. Many prompts require domain expertise to evaluate accurately. Finally, evaluations are only as good as their evaluation sets - diversity and comprehensiveness are key.
However, getting Evaluations right is essential for enterprises. Companies need a way to:
To perform an evaluation, the user begins by setting up the foundational elements – the dataset, Studio project, and project registration. Subsequently, the evaluation tasks, including data generation, annotation, monitoring, analysis, and visualization, are carried out iteratively for each evaluation.
First however, we need run a small boilerplate:
An evaluation dataset is a list of test cases that users want to benchmark their project’s performance on. A user will create an evaluation dataset by uploading a CSV file and specifying the schema of their dataset. The dataset schema is specified at creation time. At first, the only schema that we will support is a CSV file with a required input column and optional expected_output and expected_extra_info columns where input refers to the end-user prompt, expected_output refers to the expected output of the AI project, and expected_extra_info refers to any additional information that the AI project should have used to generate the output. This schema definition will be designed as a GENERATION schema. Because we don’t flatten the columns in our Postgres database, this design allows for flexible schema definitions. In the future, we can support additional schema types as needed. Defining this schema also allows users to understand how to read the dataset and what to expect in each row. This makes it easy for users to pull datasets that they did not create and use them in standardized processing scripts.
This is simply a metadata entry that describes an end user application. This is useful as a grouping mechanism to allow users to relate multiple evaluations together to a single user application. It also allows the user to specify the detailed state and components of the current application, i.e. which kind of retrieval components were used in conjunction with the LLM if any.
Registers an annotation project using Scale Studio. This is only used if Scale Studio is the platform used for annotations. Learn more about Scale Studio.
Sometimes evaluations require that external resources be created. Because these platforms do not need to share any properties it makes most sense to regard them as entirely separate components. For example, since the primary annotation platform we will use is Studio, we will allow users to create Studio projects using SGP APIs and store references to these projects in our data tables. This way, once an admin user creates a Studio project and adds taskers to it in the Studio UI, developers can create evaluations to send evaluation tasks to centralized projects.
This refers to the action of sending off a batch of tasks to evaluate a specific iteration of a user application. It will contain references to the application’s current state, the id of the application this evaluation is for, the status of the evaluation, as well as any configuration that is needed for the annotation mechanism (i.e. Studio task project id / questions, Auto LLM model name / questions, client-side function name, etc.) For each evaluation, a dataset will be referenced. The customer application will iterate through each row in the dataset to get the input value of the dataset. The customer application will generate an output for each input value in the dataset.
Each evaluation consists of test case results where the results of all test case evaluations are stored. By looking at a single test case result, you can see which test case the result is for, the dataset that test case belonged to, and which evaluation the result was a part of. Because the application state can be defined in the tags of the evaluation definition, the user can also refer to the state of the application that was used to generate the outputs for this result. Because the test-case information is stored on a per-result basis, the history of all annotation results for a given dataset row item can be constructed.
Evaluating large language models is intrinsically difficult because of the subjective nature of responses to open ended requests.
Using LLMs as evaluators are known to suffer from bias, verbosity, self-enhancement (favor answers generated by themselves) and others. Many prompts require domain expertise to evaluate accurately. Finally, evaluations are only as good as their evaluation sets - diversity and comprehensiveness are key.
However, getting Evaluations right is essential for enterprises. Companies need a way to:
To perform an evaluation, the user begins by setting up the foundational elements – the dataset, Studio project, and project registration. Subsequently, the evaluation tasks, including data generation, annotation, monitoring, analysis, and visualization, are carried out iteratively for each evaluation.
First however, we need run a small boilerplate:
An evaluation dataset is a list of test cases that users want to benchmark their project’s performance on. A user will create an evaluation dataset by uploading a CSV file and specifying the schema of their dataset. The dataset schema is specified at creation time. At first, the only schema that we will support is a CSV file with a required input column and optional expected_output and expected_extra_info columns where input refers to the end-user prompt, expected_output refers to the expected output of the AI project, and expected_extra_info refers to any additional information that the AI project should have used to generate the output. This schema definition will be designed as a GENERATION schema. Because we don’t flatten the columns in our Postgres database, this design allows for flexible schema definitions. In the future, we can support additional schema types as needed. Defining this schema also allows users to understand how to read the dataset and what to expect in each row. This makes it easy for users to pull datasets that they did not create and use them in standardized processing scripts.
This is simply a metadata entry that describes an end user application. This is useful as a grouping mechanism to allow users to relate multiple evaluations together to a single user application. It also allows the user to specify the detailed state and components of the current application, i.e. which kind of retrieval components were used in conjunction with the LLM if any.
Registers an annotation project using Scale Studio. This is only used if Scale Studio is the platform used for annotations. Learn more about Scale Studio.
Sometimes evaluations require that external resources be created. Because these platforms do not need to share any properties it makes most sense to regard them as entirely separate components. For example, since the primary annotation platform we will use is Studio, we will allow users to create Studio projects using SGP APIs and store references to these projects in our data tables. This way, once an admin user creates a Studio project and adds taskers to it in the Studio UI, developers can create evaluations to send evaluation tasks to centralized projects.
This refers to the action of sending off a batch of tasks to evaluate a specific iteration of a user application. It will contain references to the application’s current state, the id of the application this evaluation is for, the status of the evaluation, as well as any configuration that is needed for the annotation mechanism (i.e. Studio task project id / questions, Auto LLM model name / questions, client-side function name, etc.) For each evaluation, a dataset will be referenced. The customer application will iterate through each row in the dataset to get the input value of the dataset. The customer application will generate an output for each input value in the dataset.
Each evaluation consists of test case results where the results of all test case evaluations are stored. By looking at a single test case result, you can see which test case the result is for, the dataset that test case belonged to, and which evaluation the result was a part of. Because the application state can be defined in the tags of the evaluation definition, the user can also refer to the state of the application that was used to generate the outputs for this result. Because the test-case information is stored on a per-result basis, the history of all annotation results for a given dataset row item can be constructed.