For generation datasets, you have the option of using an LLM to score the evaluation. This is called an auto evaluation, and if you choose this option, the application variant will be run against the evaluation dataset and then an LLM will annotate the results based on the rubric.

How to run an autoevaluation

For generation datasets, you will be able to select auto-evaluation when configuring the evaluation run. When configuring the evaluation run on the variant, you can either select Auto-Evaluation or Hybrid to enable an auto evaluation on the run (hybrid evaluations will run both an auto-evaluation and enable humans to annotate).

This will run as an asynchronous job in the background. When the evaluation is complete, you can view the updated status on the application page.