Overview

Agent Service provides a powerful mechanism to evaluate the performance of individual nodes within a workflow. Evaluations help measure the quality of retrieval, reranking, and LLM-generated responses, ensuring that each component of a workflow meets accuracy and relevance standards.

This guide explains how to configure node evaluations using an example YAML configuration.

How Node Evaluation Works

Evaluations in Agent Service allow you to:

  1. Assess Retrieval and Reranking Quality:
    • Compare retrieved document chunks against a ground truth source.
    • Use metrics like precision@k to quantify relevance.
  2. Automatically Evaluate LLM Outputs:
    • Generate evaluation prompts to assess LLM-generated responses.
    • Use another LLM to assign a numerical evaluation score.
  3. Parse and Score Results:
    • Compare evaluation scores to a reference threshold.
    • Aggregate scores using statistical metrics (e.g., mean).
  4. Visualize Evaluations in EGP UI:
    • Store evaluations in datasets for visualization in the EGP UI.
    • Compare performance across different datasets and iterations.

Example YAML Configuration

The following YAML defines a retrieval-augmented generation (RAG) workflow with evaluations for retrieval, reranking, and LLM output quality.

Workflow Configuration

log_to_egp_ui: true

workflow:
  - name: "retrieve"
    type: "retriever"
    config:
      knowledge_base_name: "egp_services_retrieval_demo"
      knowledge_base_id: "31650637-6812-4f3f-8da7-734a5f9be336"
      num_to_return: 100
    inputs:
      query: "question"

  - name: "rerank"
    type: "reranker"
    config:
      num_to_return: 10
      scorers:
        - name: "cross-encoder"
          model: "cross-encoder/ms-marco-MiniLM-L-12-v2"
    inputs:
      query: "question"
      chunks: "retrieve.output"

  - name: "prompt"
    type: "jinja"
    config:
      data_transformations:
        context_chunks:
          jinja_template_str: "{% for chunk in value %}{{ chunk.text }}\n{% endfor %}"
      output_template:
        jinja_template_path: "egp_services/prompt_templates/default_prompt_template_query.jinja"
    inputs:
      context_chunks: "rerank.output"
      question: "question"

  - name: "llm"
    type: "generation"
    config:
      llm_model: "gpt-3.5-turbo"
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: "prompt.output"

Evaluation Configuration

1. Evaluating Retrieval Quality

  - name: "retrieval_eval"
    type: "chunk_eval"
    config:
      top_k_thresholds: [10, 25, 50, 100]
    inputs:
      chunks: "retrieve.output"
      sources: "sources"
  • Purpose: Evaluates whether the retrieved chunks contain relevant information.
  • Method: Compares retrieved chunks (retrieve.output) against ground truth sources.
  • Metric: Precision at different k values (e.g., 10, 25, 50, 100).

2. Evaluating Reranking Quality

  - name: "rerank_eval"
    type: "chunk_eval"
    config:
      top_k_thresholds: [10]
    inputs:
      chunks: "rerank.output"
      sources: "sources"
  • Purpose: Evaluates how well the reranker improves retrieval.
  • Metric: Precision@10 (only top 10 ranked results are considered).

3. Evaluating LLM Response Quality

Generating an Auto-Evaluation Prompt

  - name: "autoeval_prompt"
    type: "jinja"
    config:
      output_template:
        jinja_template_path: "egp_services/prompt_templates/basic_numeric_autoeval.jinja"
    inputs:
      question: "question"
      correct_response: "answer"
      generated_response: "llm.output"
  • Purpose: Generates an evaluation prompt for the LLM to assess the quality of its own response.
  • Method: Uses a Jinja template to format a prompt comparing the LLM-generated response (llm.output) with the ground truth answer.

Generating a Numerical Score for Evaluation

  - name: "autoeval_llm"
    type: "generation"
    config:
      llm_model: "gpt-3.5-turbo"
      max_tokens: 1
      temperature: 0.01
    inputs:
      input_prompt: "autoeval_prompt.output"
  • Purpose: Uses an LLM to assign a numeric score to its own response.
  • Method: The output is a single token representing a score (e.g., 1-5).

Parsing and Evaluating the Score

  - name: "llm_eval"
    type: "response_parser"
    config:
      action: "greater_than"
      reference_value: 3
    inputs:
      response: "autoeval_llm.output"
  • Purpose: Evaluates whether the generated score meets a quality threshold.
  • Method: Checks if the evaluation score is greater than 3.

Metrics Configuration**

metrics:
  retrieval_eval: "mean"
  rerank_eval: "mean"
  llm_eval: "mean"

Aggregates evaluation results using mean scores for:

  • Retrieval (retrieval_eval).
  • Reranking (rerank_eval).
  • LLM response evaluation (llm_eval).

EGP UI Integration

egp_ui_evaluation:
  use_egp_ui_ds_only: false
  datasets:
    - dataset_id: "99982167-4418-4ce8-a0e1-7b3f7dff9314"

  application_spec_id: "7b177d5a-9705-4a97-b3f9-bf5305e5fc06"
  question_set_id: "338235cb-bc70-428b-9591-ee2c2349bef0"

  final_output_node: "llm"
  extra_info_keys:
    - "rerank"
  auto_eval_nodes:
    Answer quality: autoeval_llm
  • Stores evaluation results in the EGP UI for visualization and analysis.
  • Links evaluations to datasets for historical comparison.
  • Auto-evaluates the final LLM response quality.

Why Use Node Evaluations?

  • Ensure High-Quality Responses: Evaluate LLM output against known answers. Optimize Retrieval & Reranking: Identify how well retrieved documents align with ground truth.
  • Automate Quality Assurance: Use auto-evaluations to continuously measure AI performance.
  • Track Model Performance Over Time: Store results in EGP UI for monitoring and improvement.

This configuration provides a fully automated evaluation pipeline for assessing retrieval, reranking, and response quality in Agent Service workflows.