Standard SGP evaluations assume that your AI application has one input and one output, optionally with some retrieval chunks attached.

This kind of evaluation fits chatbots and retrieval augmented generation (RAG) applications well. However, as our engineers evaluated agents, we found that complex AI applications often have more than one input or output, or will have multiple intermediate steps that need to be evaluated independently. We built Flexible Evaluations from our learnings to provide a powerful and customizable way of evaluating agents and other complex AI applications.

Flexible evaluations enable you to:

  1. Evaluate applications with multiple inputs and outputs.
  2. Evaluate applications that have multiple steps, and enable users to evaluate each step of the application.
  3. Surface the exactly the right data to human evaluators, so you can increase evaluation velocity.

In the following sections, we’ll explain how flexible evaluations can be used to evaluate complex AI applications.