RAG
Build a chat-bot for retrieval augmented generation.
As a next step, we’ll implement Retrieval-Augmented Generation (RAG) using Agent Service. The following YAML defines a workflow that retrieves relevant knowledge from a specified knowledge base and uses that information to generate an LLM-powered response.
Overview
This RAG workflow enables an AI agent to:
- Extract the latest user message.
- Retrieve relevant context from a knowledge base.
- Format the retrieved context into a structured prompt.
- Generate a response using a large language model (LLM).
By leveraging a retriever node, the agent can enhance its responses with factual, up-to-date information while avoiding hallucinations.
YAML Workflow
Workflow Breakdown
Step | Node Name | Type | Purpose |
---|---|---|---|
1 | get_last_message | get_message | Extracts the most recent user message |
2 | retrieve | retriever | Searches knowledge bases for relevant information |
3 | prompt | jinja | Formats retrieved context into a structured LLM prompt |
4 | llm | generation | Generates a response based on the retrieved knowledge |
1. Extract Latest User Message
Node: get_last_message
- Type:
get_message
- Function: Retrieves the last user message from the conversation history.
- Configuration:
index: -1
ensures that the last message is extracted.- Input:
messages
(provided byuser_input
). - Output: The extracted last user message.
2. Retrieve Relevant Context
Node: retrieve
- Type:
retriever
- Function: Searches the specified knowledge bases to find up to 10 relevant pieces of information.
- Configuration:
num_to_return: 10
specifies the number of results to retrieve.- Input:
query
: The last user message (get_last_message.output
).knowledge_base_ids
: The set of knowledge bases to search (user_input.knowledge_base_ids
).
- Output: A list of retrieved context chunks.
3. Format the Prompt
Node: prompt
- Type:
jinja
- Function: Constructs a structured prompt for the LLM using retrieved knowledge.
- Configuration:
- Uses a Jinja template to format the retrieved information into a structured message.
- Template Logic:
- If relevant context exists, it is included in the prompt.
- If no relevant context is found, the LLM is instructed not to answer.
- Input:
context_chunks
: The retrieved information (retrieve.output
).question
: The original user message (get_last_message.output
).
- Output: A well-structured prompt.
4. Generate an AI Response
Node: llm
- Type:
generation
- Function: Uses an LLM to generate a response based on the structured prompt.
- Configuration:
- Model:
gpt-4o-mini
- Max Tokens:
512
- Temperature:
0.2
(low variability for more deterministic responses). - Input:
input_prompt
fromprompt.output
. - Output: The final AI-generated response.
- Model:
Adding a Reranker to Improve Performance
A reranker improves RAG by refining retrieved documents before they reach the LLM, ensuring higher relevance and accuracy. While the retriever fetches multiple chunks based on similarity, not all are equally useful.
The reranker re-scores and reorders these chunks using a more precise model (e.g., a cross-encoder) to prioritize the most relevant ones. This reduces noise, prevents misleading answers, and optimizes token usage, leading to clearer, more factually grounded responses. It’s especially useful when retrieving from large or noisy knowledge bases where relevance is critical.
Agent Service supports a native reranker
node, which calls a reranking model by unique name encoder model deployment id.
The YAML from above, with the added reranker looks like this:
Next Steps
Further improvements of this workflow might include:
- Customize the Retrieval Strategy: Adjust
num_to_return
to fine-tune the number of retrieved documents. - Experiment with Different LLMs: Swap
gpt-4o-mini
with a larger model for improved performance. - Enhance Context Handling: Modify the Jinja template to optimize how retrieved knowledge is incorporated.