Evaluation

Overview

Agent Service provides a powerful mechanism to evaluate the performance of individual nodes within a workflow. Evaluations help measure the quality of retrieval, reranking, and LLM-generated responses, ensuring that each component of a workflow meets accuracy and relevance standards. This guide explains how to configure node evaluations using an example YAML configuration.

How Node Evaluation Works

Evaluations in Agent Service allow you to:

Assess Retrieval and Reranking Quality:
- Compare retrieved document chunks against a ground truth source.
- Use metrics like precision@k to quantify relevance.
Automatically Evaluate LLM Outputs:
- Generate evaluation prompts to assess LLM-generated responses.
- Use another LLM to assign a numerical evaluation score.
Parse and Score Results:
- Compare evaluation scores to a reference threshold.
- Aggregate scores using statistical metrics (e.g., mean).
Visualize Evaluations in EGP UI:
- Store evaluations in datasets for visualization in the EGP UI.
- Compare performance across different datasets and iterations.

Example YAML Configuration

The following YAML defines a retrieval-augmented generation (RAG) workflow with evaluations for retrieval, reranking, and LLM output quality.

Workflow Configuration

log_to_egp_ui: true

workflow:
  - name: "retrieve"
    type: "retriever"
    config:
      knowledge_base_name: "egp_services_retrieval_demo"
      knowledge_base_id: "31650637-6812-4f3f-8da7-734a5f9be336"
      num_to_return: 100
    inputs:
      query: "question"

  - name: "rerank"
    type: "reranker"
    config:
      num_to_return: 10
      scorers:
        - name: "cross-encoder"
          model: "cross-encoder/ms-marco-MiniLM-L-12-v2"
    inputs:
      query: "question"
      chunks: "retrieve.output"

  - name: "prompt"
    type: "jinja"
    config:
      data_transformations:
        context_chunks:
          jinja_template_str: "{% for chunk in value %}{{ chunk.text }}\n{% endfor %}"
      output_template:
        jinja_template_path: "egp_services/prompt_templates/default_prompt_template_query.jinja"
    inputs:
      context_chunks: "rerank.output"
      question: "question"

  - name: "llm"
    type: "generation"
    config:
      llm_model: "gpt-3.5-turbo"
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: "prompt.output"

Evaluation Configuration

1. Evaluating Retrieval Quality

  - name: "retrieval_eval"
    type: "chunk_eval"
    config:
      top_k_thresholds: [10, 25, 50, 100]
    inputs:
      chunks: "retrieve.output"
      sources: "sources"

Purpose: Evaluates whether the retrieved chunks contain relevant information.
Method: Compares retrieved chunks (retrieve.output) against ground truth sources.
Metric: Precision at different k values (e.g., 10, 25, 50, 100).

2. Evaluating Reranking Quality

  - name: "rerank_eval"
    type: "chunk_eval"
    config:
      top_k_thresholds: [10]
    inputs:
      chunks: "rerank.output"
      sources: "sources"

Purpose: Evaluates how well the reranker improves retrieval.
Metric: Precision@10 (only top 10 ranked results are considered).

3. Evaluating LLM Response Quality

Generating an Auto-Evaluation Prompt

  - name: "autoeval_prompt"
    type: "jinja"
    config:
      output_template:
        jinja_template_path: "egp_services/prompt_templates/basic_numeric_autoeval.jinja"
    inputs:
      question: "question"
      correct_response: "answer"
      generated_response: "llm.output"

Purpose: Generates an evaluation prompt for the LLM to assess the quality of its own response.
Method: Uses a Jinja template to format a prompt comparing the LLM-generated response (llm.output) with the ground truth answer.

Generating a Numerical Score for Evaluation

  - name: "autoeval_llm"
    type: "generation"
    config:
      llm_model: "gpt-3.5-turbo"
      max_tokens: 1
      temperature: 0.01
    inputs:
      input_prompt: "autoeval_prompt.output"

Purpose: Uses an LLM to assign a numeric score to its own response.
Method: The output is a single token representing a score (e.g., 1-5).

Parsing and Evaluating the Score

  - name: "llm_eval"
    type: "response_parser"
    config:
      action: "greater_than"
      reference_value: 3
    inputs:
      response: "autoeval_llm.output"

Purpose: Evaluates whether the generated score meets a quality threshold.
Method: Checks if the evaluation score is greater than 3.

Metrics Configuration**

metrics:
  retrieval_eval: "mean"
  rerank_eval: "mean"
  llm_eval: "mean"

Aggregates evaluation results using mean scores for:

Retrieval (retrieval_eval).
Reranking (rerank_eval).
LLM response evaluation (llm_eval).

EGP UI Integration

egp_ui_evaluation:
  use_egp_ui_ds_only: false
  datasets:
    - dataset_id: "99982167-4418-4ce8-a0e1-7b3f7dff9314"

  application_spec_id: "7b177d5a-9705-4a97-b3f9-bf5305e5fc06"
  question_set_id: "338235cb-bc70-428b-9591-ee2c2349bef0"

  final_output_node: "llm"
  extra_info_keys:
    - "rerank"
  auto_eval_nodes:
    Answer quality: autoeval_llm

Stores evaluation results in the EGP UI for visualization and analysis.
Links evaluations to datasets for historical comparison.
Auto-evaluates the final LLM response quality.

Why Use Node Evaluations?

Ensure High-Quality Responses: Evaluate LLM output against known answers. Optimize Retrieval & Reranking: Identify how well retrieved documents align with ground truth.
Automate Quality Assurance: Use auto-evaluations to continuously measure AI performance.
Track Model Performance Over Time: Store results in EGP UI for monitoring and improvement.

This configuration provides a fully automated evaluation pipeline for assessing retrieval, reranking, and response quality in Agent Service workflows.

Getting Started

Agents

Evaluations

Uploading Data

Creating Applications

Creating Applications

Evaluation Datasets

Inference

Miscellaneous

Managing Annotations

Components

Overview

How Node Evaluation Works

Example YAML Configuration

Workflow Configuration

Evaluation Configuration

1. Evaluating Retrieval Quality

2. Evaluating Reranking Quality

3. Evaluating LLM Response Quality

Metrics Configuration**

EGP UI Integration

Why Use Node Evaluations?

Getting Started

Agents

Evaluations

Uploading Data

Creating Applications

Creating Applications

Evaluation Datasets

Inference

Miscellaneous

Managing Annotations

Components

​Overview

​How Node Evaluation Works

​Example YAML Configuration

​Workflow Configuration

​Evaluation Configuration

​1. Evaluating Retrieval Quality

​2. Evaluating Reranking Quality

​3. Evaluating LLM Response Quality

​Metrics Configuration**

​EGP UI Integration

​Why Use Node Evaluations?

Overview

How Node Evaluation Works

Example YAML Configuration

Workflow Configuration

Evaluation Configuration

1. Evaluating Retrieval Quality

2. Evaluating Reranking Quality

3. Evaluating LLM Response Quality

Metrics Configuration**

EGP UI Integration

Why Use Node Evaluations?