> ## Documentation Index
> Fetch the complete documentation index at: https://docs.gp.scale.com/llms.txt
> Use this file to discover all available pages before exploring further.

# RAG

> Build a chat-bot for retrieval augmented generation.

As a next step, we'll implement **Retrieval-Augmented Generation (RAG)** using Agent Service. The following YAML defines a workflow that retrieves relevant knowledge from a specified knowledge base and uses that information to generate an LLM-powered response.

## Overview

This RAG workflow enables an AI agent to:

1. Extract the latest user message.
2. Retrieve relevant context from a knowledge base.
3. Format the retrieved context into a structured prompt.
4. Generate a response using a large language model (LLM).

By leveraging a **retriever** node, the agent can enhance its responses with factual, up-to-date information while avoiding hallucinations.

## YAML Workflow

```yaml theme={null}
user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages
  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids
  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: retrieve.output
      question: get_last_message.output
  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output
```

## Workflow Breakdown

| Step | Node Name          | Type          | Purpose                                                |
| ---- | ------------------ | ------------- | ------------------------------------------------------ |
| 1    | `get_last_message` | `get_message` | Extracts the most recent user message                  |
| 2    | `retrieve`         | `retriever`   | Searches knowledge bases for relevant information      |
| 3    | `prompt`           | `jinja`       | Formats retrieved context into a structured LLM prompt |
| 4    | `llm`              | `generation`  | Generates a response based on the retrieved knowledge  |

### 1. Extract Latest User Message

#### Node: `get_last_message`

* **Type:** `get_message`
* **Function:** Retrieves the last user message from the conversation history.
* **Configuration:**
  * `index: -1` ensures that the last message is extracted.
  * **Input:** `messages` (provided by `user_input`).
  * **Output:** The extracted last user message.

### 2. Retrieve Relevant Context

#### Node: `retrieve`

* **Type:** `retriever`
* **Function:** Searches the specified knowledge bases to find up to 10 relevant pieces of information.
* **Configuration:**
  * `num_to_return: 10` specifies the number of results to retrieve.
  * **Input:**
    * `query`: The last user message (`get_last_message.output`).
    * `knowledge_base_ids`: The set of knowledge bases to search (`user_input.knowledge_base_ids`).
  * **Output:** A list of retrieved context chunks.

### 3. Format the Prompt

#### Node: `prompt`

* **Type:** `jinja`
* **Function:** Constructs a structured prompt for the LLM using retrieved knowledge.
* **Configuration:**
  * Uses a Jinja template to format the retrieved information into a structured message.
  * **Template Logic:**
    * If relevant context exists, it is included in the prompt.
    * If no relevant context is found, the LLM is instructed **not** to answer.
  * **Input:**
    * `context_chunks`: The retrieved information (`retrieve.output`).
    * `question`: The original user message (`get_last_message.output`).
  * **Output:** A well-structured prompt.

### 4. Generate an AI Response

#### Node: `llm`

* **Type:** `generation`
* **Function:** Uses an LLM to generate a response based on the structured prompt.
* **Configuration:**
  * **Model:** `gpt-4o-mini`
  * **Max Tokens:** `512`
  * **Temperature:** `0.2` (low variability for more deterministic responses).
  * **Input:** `input_prompt` from `prompt.output`.
  * **Output:** The final AI-generated response.

## Adding a Reranker to Improve Performance

A reranker improves RAG by refining retrieved documents before they reach the LLM, ensuring higher relevance and accuracy. While the retriever fetches multiple chunks based on similarity, not all are equally useful.

The reranker re-scores and reorders these chunks using a more precise model (e.g., a cross-encoder) to prioritize the most relevant ones. This reduces noise, prevents misleading answers, and optimizes token usage, leading to clearer, more factually grounded responses. It's especially useful when retrieving from large or noisy knowledge bases where relevance is critical.

Agent Service **supports a native `reranker` node**, which calls a reranking model by unique name encoder model deployment id.

The YAML from above, with the added reranker looks like this:

```yaml theme={null}
user_input:
  messages:
    type: Messages
  knowledge_base_ids:
    type: KnowledgeBaseIds

workflow:
  - name: get_last_message
    type: get_message
    config:
      index: -1
    inputs:
      messages: messages

  - name: retrieve
    type: retriever
    config:
      num_to_return: 10
    inputs:
      query: get_last_message.output
      knowledge_base_ids: knowledge_base_ids

  - name: doc_reranker
    type: reranker
    config:
      num_to_return: 3
      scorers:
        - name: cross-encoder
          model: cross-encoder/ms-marco-MiniLM-L-12-v2
    inputs:
      query: get_last_message.output
      chunks: retrieve.output

  - name: prompt
    type: jinja
    config:
      output_template:
        jinja_template_str: >
          Use the following pieces of context to answer the user's question. If
          there is no relevant context, don't answer the question and let the
          user know that the context provided cannot answer the question.

          Context: {% for chunk in context_chunks %} {{ chunk.text }} {% endfor %}

          Question: {{ question }}
    inputs:
      context_chunks: doc_reranker.output
      question: get_last_message.output

  - name: llm
    type: generation
    config:
      llm_model: gpt-4o-mini
      max_tokens: 512
      temperature: 0.2
    inputs:
      input_prompt: prompt.output
```

## Next Steps

Further improvements of this workflow might include:

* **Customize the Retrieval Strategy:** Adjust `num_to_return` to fine-tune the number of retrieved documents.
* **Experiment with Different LLMs:** Swap `gpt-4o-mini` with a larger model for improved performance.
* **Enhance Context Handling:** Modify the Jinja template to optimize how retrieved knowledge is incorporated.
