Evaluation
Dynamically evaluate agents per workflow or per node
Overview
Agent Service provides a powerful mechanism to evaluate the performance of individual nodes within a workflow. Evaluations help measure the quality of retrieval, reranking, and LLM-generated responses, ensuring that each component of a workflow meets accuracy and relevance standards.
This guide explains how to configure node evaluations using an example YAML configuration.
How Node Evaluation Works
Evaluations in Agent Service allow you to:
- Assess Retrieval and Reranking Quality:
- Compare retrieved document chunks against a ground truth source.
- Use metrics like precision@k to quantify relevance.
- Automatically Evaluate LLM Outputs:
- Generate evaluation prompts to assess LLM-generated responses.
- Use another LLM to assign a numerical evaluation score.
- Parse and Score Results:
- Compare evaluation scores to a reference threshold.
- Aggregate scores using statistical metrics (e.g., mean).
- Visualize Evaluations in EGP UI:
- Store evaluations in datasets for visualization in the EGP UI.
- Compare performance across different datasets and iterations.
Example YAML Configuration
The following YAML defines a retrieval-augmented generation (RAG) workflow with evaluations for retrieval, reranking, and LLM output quality.
Workflow Configuration
Evaluation Configuration
1. Evaluating Retrieval Quality
- Purpose: Evaluates whether the retrieved chunks contain relevant information.
- Method: Compares retrieved chunks (retrieve.output) against ground truth sources.
- Metric: Precision at different k values (e.g., 10, 25, 50, 100).
2. Evaluating Reranking Quality
- Purpose: Evaluates how well the reranker improves retrieval.
- Metric: Precision@10 (only top 10 ranked results are considered).
3. Evaluating LLM Response Quality
Generating an Auto-Evaluation Prompt
- Purpose: Generates an evaluation prompt for the LLM to assess the quality of its own response.
- Method: Uses a Jinja template to format a prompt comparing the LLM-generated response (llm.output) with the ground truth answer.
Generating a Numerical Score for Evaluation
- Purpose: Uses an LLM to assign a numeric score to its own response.
- Method: The output is a single token representing a score (e.g., 1-5).
Parsing and Evaluating the Score
- Purpose: Evaluates whether the generated score meets a quality threshold.
- Method: Checks if the evaluation score is greater than 3.
Metrics Configuration**
Aggregates evaluation results using mean scores for:
- Retrieval (retrieval_eval).
- Reranking (rerank_eval).
- LLM response evaluation (llm_eval).
EGP UI Integration
- Stores evaluation results in the EGP UI for visualization and analysis.
- Links evaluations to datasets for historical comparison.
- Auto-evaluates the final LLM response quality.
Why Use Node Evaluations?
- Ensure High-Quality Responses: Evaluate LLM output against known answers. Optimize Retrieval & Reranking: Identify how well retrieved documents align with ground truth.
- Automate Quality Assurance: Use auto-evaluations to continuously measure AI performance.
- Track Model Performance Over Time: Store results in EGP UI for monitoring and improvement.
This configuration provides a fully automated evaluation pipeline for assessing retrieval, reranking, and response quality in Agent Service workflows.