Currently, the SGP platform supports evaluations done by LLMs in addition to humans. This is called an auto evaluation, and if you choose this option, the application variant will be run against the evaluation dataset and then an LLM will annotate the results based on the rubric.

Evaluation types that support autoevaluations through the UI

Currently, three types of evaluations support autoevaluations through the UI:

For other Flexible Dataset configurations, Auto Evaluations are not yet supported through the UI