As a next step, we’ll implement Retrieval-Augmented Generation (RAG) using Agent Service. The following YAML defines a workflow that retrieves relevant knowledge from a specified knowledge base and uses that information to generate an LLM-powered response.

Overview

This RAG workflow enables an AI agent to:

  1. Extract the latest user message.
  2. Retrieve relevant context from a knowledge base.
  3. Format the retrieved context into a structured prompt.
  4. Generate a response using a large language model (LLM).

By leveraging a retriever node, the agent can enhance its responses with factual, up-to-date information while avoiding hallucinations.

YAML Workflow

user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages
  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids
  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: retrieve.output
      question: get_last_message.output
  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output

Workflow Breakdown

StepNode NameTypePurpose
1get_last_messageget_messageExtracts the most recent user message
2retrieveretrieverSearches knowledge bases for relevant information
3promptjinjaFormats retrieved context into a structured LLM prompt
4llmgenerationGenerates a response based on the retrieved knowledge

1. Extract Latest User Message

Node: get_last_message

  • Type: get_message
  • Function: Retrieves the last user message from the conversation history.
  • Configuration:
    • index: -1 ensures that the last message is extracted.
    • Input: messages (provided by user_input).
    • Output: The extracted last user message.

2. Retrieve Relevant Context

Node: retrieve

  • Type: retriever
  • Function: Searches the specified knowledge bases to find up to 10 relevant pieces of information.
  • Configuration:
    • num_to_return: 10 specifies the number of results to retrieve.
    • Input:
      • query: The last user message (get_last_message.output).
      • knowledge_base_ids: The set of knowledge bases to search (user_input.knowledge_base_ids).
    • Output: A list of retrieved context chunks.

3. Format the Prompt

Node: prompt

  • Type: jinja
  • Function: Constructs a structured prompt for the LLM using retrieved knowledge.
  • Configuration:
    • Uses a Jinja template to format the retrieved information into a structured message.
    • Template Logic:
      • If relevant context exists, it is included in the prompt.
      • If no relevant context is found, the LLM is instructed not to answer.
    • Input:
      • context_chunks: The retrieved information (retrieve.output).
      • question: The original user message (get_last_message.output).
    • Output: A well-structured prompt.

4. Generate an AI Response

Node: llm

  • Type: generation
  • Function: Uses an LLM to generate a response based on the structured prompt.
  • Configuration:
    • Model: gpt-4o-mini
    • Max Tokens: 512
    • Temperature: 0.2 (low variability for more deterministic responses).
    • Input: input_prompt from prompt.output.
    • Output: The final AI-generated response.

Adding a Reranker to Improve Performance

A reranker improves RAG by refining retrieved documents before they reach the LLM, ensuring higher relevance and accuracy. While the retriever fetches multiple chunks based on similarity, not all are equally useful.

The reranker re-scores and reorders these chunks using a more precise model (e.g., a cross-encoder) to prioritize the most relevant ones. This reduces noise, prevents misleading answers, and optimizes token usage, leading to clearer, more factually grounded responses. It’s especially useful when retrieving from large or noisy knowledge bases where relevance is critical.

Agent Service supports a native reranker node, which calls a reranking model by unique name encoder model deployment id.

The YAML from above, with the added reranker looks like this:

user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages

  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids

  - name: doc_reranker
    type: reranker
    config:
      num_to_return: 3
      scorers:
        - name: cross-encoder
          model: cross-encoder/ms-marco-MiniLM-L-12-v2
    inputs:
      query: get_last_message.output
      chunks: retrieve.output

  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: doc_reranker.output
      question: get_last_message.output

  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output

Next Steps

Further improvements of this workflow might include:

  • Customize the Retrieval Strategy: Adjust num_to_return to fine-tune the number of retrieved documents.
  • Experiment with Different LLMs: Swap gpt-4o-mini with a larger model for improved performance.
  • Enhance Context Handling: Modify the Jinja template to optimize how retrieved knowledge is incorporated.