For flexible evaluation datasets that conform to the summarization use case schema, users can choose to run an auto-evaluation on the dataset.

Configuring Auto-evaluations

  1. Create Application and Dataset - In order to kick-off an auto-evaluation for summarization use case, you first need to create an application variant and dataset with a summarization use case. Instructions can be found here.

  2. Navigate to the Application View - Once you have the application and dataset created, navigate to the application view and click on “Evaluate Variant”

  3. Configure Evaluation - Configure the evaluation by selecting the dataset and question set. In order to run an autoevaluation, the dataset selected must have the summarization configuration.

  4. Autoevaluation Configuration - The autoevaluation has a few configurations that can be changed.

    Model - This is the LLM that is used to score the evaluation. All models that are installed and compatible with autoevals will be available in the dropdown.

    Prompt - This is the prompt that will be fed to the evaluation. There is a default prompt for summarization use cases, or you can edit the prompt template by selecting customize prompt. See below for instructions.

  5. Prompt Editing - The prompt can be customized through the platform through clicking on the customize prompt button.

    Variables - Clicking on the variables button shows users the variables that are available to use for the prompt template and a description of what those variables are. For the summarization use case, the available variables are:

    • document: full source text of what is being summarized
    • expected_summary: the expected value of what is being summarized
    • summary: the actual summary returned by the application
    • evaluation_question: the prompt that tells the LLM to answer the evaluation question from the rubric. This is currently non configurable and required to be included in the prompt template. See below for more details.

    Evaluation Question - This is a prompt that tells the LLM to answer the evaluation question from the rubric. Currently, it is non configurable, and the prompt for this template depends on the type of question that is configured in the rubric. For example, it’ll prompt the LLM differently on how to ansewr the question based on if it’s a multiple choice or free text question. Including this variable is required in the overall prompt, and the contents of this prompt will be injected into the overall prompt at the location of the variable.

Viewing Results of Auto-evaluations

Results for an autoevaluation can be viewed the same way as other evaluations. To see evaluation results, navigate to the application page and click on the evaluation that was run. For autoevaluations, users will be able to see the prompt that was given to the LLM and also the reasoning for why the evaluation was scored the way it was.