RAG

As a next step, we’ll implement Retrieval-Augmented Generation (RAG) using Agent Service. The following YAML defines a workflow that retrieves relevant knowledge from a specified knowledge base and uses that information to generate an LLM-powered response.

Overview

This RAG workflow enables an AI agent to:

Extract the latest user message.
Retrieve relevant context from a knowledge base.
Format the retrieved context into a structured prompt.
Generate a response using a large language model (LLM).

By leveraging a retriever node, the agent can enhance its responses with factual, up-to-date information while avoiding hallucinations.

YAML Workflow

user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages
  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids
  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: retrieve.output
      question: get_last_message.output
  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output

Workflow Breakdown

Step	Node Name	Type	Purpose
1	`get_last_message`	`get_message`	Extracts the most recent user message
2	`retrieve`	`retriever`	Searches knowledge bases for relevant information
3	`prompt`	`jinja`	Formats retrieved context into a structured LLM prompt
4	`llm`	`generation`	Generates a response based on the retrieved knowledge

1. Extract Latest User Message

Node: `get_last_message`

Type: get_message
Function: Retrieves the last user message from the conversation history.
Configuration:
- index: -1 ensures that the last message is extracted.
- Input: messages (provided by user_input).
- Output: The extracted last user message.

2. Retrieve Relevant Context

Node: `retrieve`

Type: retriever
Function: Searches the specified knowledge bases to find up to 10 relevant pieces of information.
Configuration:
- num_to_return: 10 specifies the number of results to retrieve.
- Input:
  - query: The last user message (get_last_message.output).
  - knowledge_base_ids: The set of knowledge bases to search (user_input.knowledge_base_ids).
- Output: A list of retrieved context chunks.

3. Format the Prompt

Node: `prompt`

Type: jinja
Function: Constructs a structured prompt for the LLM using retrieved knowledge.
Configuration:
- Uses a Jinja template to format the retrieved information into a structured message.
- Template Logic:
  - If relevant context exists, it is included in the prompt.
  - If no relevant context is found, the LLM is instructed not to answer.
- Input:
  - context_chunks: The retrieved information (retrieve.output).
  - question: The original user message (get_last_message.output).
- Output: A well-structured prompt.

4. Generate an AI Response

Node: `llm`

Type: generation
Function: Uses an LLM to generate a response based on the structured prompt.
Configuration:
- Model: gpt-4o-mini
- Max Tokens: 512
- Temperature: 0.2 (low variability for more deterministic responses).
- Input: input_prompt from prompt.output.
- Output: The final AI-generated response.

Adding a Reranker to Improve Performance

A reranker improves RAG by refining retrieved documents before they reach the LLM, ensuring higher relevance and accuracy. While the retriever fetches multiple chunks based on similarity, not all are equally useful. The reranker re-scores and reorders these chunks using a more precise model (e.g., a cross-encoder) to prioritize the most relevant ones. This reduces noise, prevents misleading answers, and optimizes token usage, leading to clearer, more factually grounded responses. It’s especially useful when retrieving from large or noisy knowledge bases where relevance is critical. Agent Service supports a native reranker node, which calls a reranking model by unique name encoder model deployment id. The YAML from above, with the added reranker looks like this:

user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages

  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids

  - name: doc_reranker
    type: reranker
    config:
      num_to_return: 3
      scorers:
        - name: cross-encoder
          model: cross-encoder/ms-marco-MiniLM-L-12-v2
    inputs:
      query: get_last_message.output
      chunks: retrieve.output

  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: doc_reranker.output
      question: get_last_message.output

  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output

Next Steps

Further improvements of this workflow might include:

Customize the Retrieval Strategy: Adjust num_to_return to fine-tune the number of retrieved documents.
Experiment with Different LLMs: Swap gpt-4o-mini with a larger model for improved performance.
Enhance Context Handling: Modify the Jinja template to optimize how retrieved knowledge is incorporated.

Getting Started

Agents

Evaluations

Uploading Data

Creating Applications

Creating Applications

Evaluation Datasets

Inference

Miscellaneous

Managing Annotations

Components

Overview

YAML Workflow

Workflow Breakdown

1. Extract Latest User Message

Node: `get_last_message`

2. Retrieve Relevant Context

Node: `retrieve`

3. Format the Prompt

Node: `prompt`

4. Generate an AI Response

Node: `llm`

Adding a Reranker to Improve Performance

Next Steps

Getting Started

Agents

Evaluations

Uploading Data

Creating Applications

Creating Applications

Evaluation Datasets

Inference

Miscellaneous

Managing Annotations

Components

​Overview

​YAML Workflow

​Workflow Breakdown

​1. Extract Latest User Message

​Node: get_last_message

​2. Retrieve Relevant Context

​Node: retrieve

​3. Format the Prompt

​Node: prompt

​4. Generate an AI Response

​Node: llm

​Adding a Reranker to Improve Performance

​Next Steps

Overview

YAML Workflow

Workflow Breakdown

1. Extract Latest User Message

Node: `get_last_message`

2. Retrieve Relevant Context

Node: `retrieve`

3. Format the Prompt

Node: `prompt`

4. Generate an AI Response

Node: `llm`

Adding a Reranker to Improve Performance

Next Steps