Retrieval
Key Takeaway
The key takeaway of RAG is that, is set up accurately, customers can customize LLMs to their applications without modifying any of their traditional data storage methods. By simply building RAG pipelines, which are analogous to traditional ETL pipelines, data can just be maintained at the source, and LLM applications will inherit this knowledge automatically.
How to do retrieval in SGP
In this section we dive right into things. For more information about what retrieval is and how it works, scroll to the FAQ below.
SGP Knowledge Bases
The first step of retrieval is to load custom data into a vector database. Vector databases allow users to search unstructured data using natural language queries. In SGP, our Knowledge Base API manages the entire lifecycle of data ingestion into vector databases on behalf of users.
This means that a user only has to manage data at the source. Simply create a knowledge base to reflect the source and periodically upload data to it.
SGP will take care of all of the underlying challenges, so users don’t have to:
- Vector Index Creation and Management
- Optimize shard density and size for performance
- Automatically create new indexes when optimal index sizes are exceeded
- Multiple Data Source Integrations
- Supports Google Drive, S3, Sharepoint, Direct JSON Upload, and more.
- Smart File-Diff Uploads
- Delete artifacts deleted from source
- Re-index artifacts modified at source
- Do nothing for artifacts unchanged at source
- Worker parallelization
- Scale ingestion horizontally to maximize throughput
- SGP can ingest documents at ~500MB/hour using less than 100 worker nodes. This throughput can easily increased by bumping the number of nodes for both the ingestion workers and the embedding model. Throughput can also improve by optimizing the hardware the embedding model is hosted on.
- Autoscaling
- Autoscale ingestion workers to lower costs during dormancy and burst for spiky workloads
- Text extraction
- Automatically extract text from non-text documents, i.e. DOCX, PPTX, PDF
- Chunking
- Select from a list of SGP supported chunking strategies to automatically split data into chunks during ingestion
- Easily swap out chunking strategies just by varying a small API request payload
- Embedding
- Automatically embed each chunk of text into a vector for storage in the vector DB
- Create knowledge bases with different embedding models to test how different embedding models affect retrieval performance
To set up retrieval with SGP, first create a knowledge base. How to best organize data into separate knowledge bases depends on the use case and is still an area of research. For demo purposes, we will simply create a single knowledge base and upload the contents of an S3 bucket to it.
That’s it! Data is now being ingested into your knowledge base. To check the status of the upload, simply poll the upload status:
To analyze the contents of your knowledge base, you can lists its artifacts.
You can also analyze the text chunks that were extracted for a specific artifact to see if the text extraction and chunking worked properly.
Note: It is recommended that users create a knowledge base with a small sample of data and to investigate its contents before ingesting large amounts of data.
Querying your knowledge base
To query a knowledge base, simply submit a natural language query. Behind the scenes, this query
is being embedded using the same embedding model that data ingested into the knowledge base used
as well. A query performs a similarity search between the embedded query and the embedding
vectors in the knowledge base. This API returns the top_k
chunks of text most semantically
similar to the query.
Optimizing Retrieval Accuracy
At Scale, we have observed that trusting the naive query performance of a vector database does not yield good retrieval accuracy on its own. One of the most powerful ways to improve retrieval accuracy is to re-rank the chunks returned from an initial query.
The easiest way to improve performance is to perform a high-recall query on a knowledge base
(set top_k
to a large number) and then use a cross-encoder model to re-rank the chunks before
choosing the top-scoring chunks to append to the user query for the final LLM prompt.
Users can investigate how this reranking step improved their retrieval accuracy by investigating the difference between the initial top chunks and the reranked top chunks:
Re-ranking can affect the latency of the end application, so there are a variety of techniques to bring this latency down. Please talk to your Scale representative for recommended techniques on how to maximize accuracy and reduce latency.
Prompt Engineering and RAG
Now that users have high quality chunks, it is time to augment the initial user query with the retrieved information. This piece is more experimental and prompt can vary depending on the use case, so it is important for users to attempt multiple prompts.
Users can also consult their Scale representative for advice on how to craft an effective prompt.
A standard prompt engineering function may look like the following:
Lastly, we feed the augmented prompt to our an LLM using our completions API. Here we are using GPT 3.5 Turbo, but we can swap this LLM model out for any other LLM that SGP supports.
Improve Response Accuracy
As you can see, there are many steps in the retrieval process. Let’s review them:
- Choose an embedding model*
- Create a knowledge base using that embedding model
- Choose a chunking strategy*
- Ingest data into the knowledge base directly or from a supported data source using the chosen embedding model and chunking strategy
- Choose a reranking model*
- Rerank chunks queried from the knowledge base using the chosen reranking model
- Choose an LLM*
- Engineer a retrieval-augmented prompt*
- Generate a response using the chosen LLM
After setting up this pipeline, the next obvious step is to improve the AIs response accuracy. Here, SGP demonstrates one of its most powerful features. Every step denoted with an asterisk can be swapped out for quick experimentation without modifying the rest of the RAG pipeline.
Below, we discuss several scenarios and how to optimize some of these steps.
Note: Currently, if model finetuning is needed, customers are encouraged to engage with their Scale representatives to decide what kind of data to collect. Scale representatives will finetune the models on this data, then deploy them to the customer’s SGP platform. To use the finetuned model, the user simply has to swap the model name for the finetuned model name. In the near future, finetuning will be supported as a self-service feature via the SGP API.
Modify the Chunking Strategy
Static chunking is currently the simplest way to split up unstructured data. However, more intelligent chunking may be needed for specific use cases and data types. SGP allows users to swap out chunking strategies for other supported strategies easily. Swapping out a chunking strategy in an upload job to an existing knowledge re-index all artifacts from the data source using the new chunking strategy.
SGP will continue to add supported chunking strategies based on demand.
Finetune or Swap the Reranking Model
Scale has internally discovered that reranking chunks is one of the most effective ways to improve retrieval accuracy. By simply replacing the base re-ranking model with a finetuned or alternative version of the model, users will get easy bumps in retrieval accuracy.
Here are some signs that indicate that the reranking model should be finetuned:
- The query can be easily matched by a human to chunks that are available in the knowledge base
- The correct chunks to retrieve are available in the knowledge base, but the reranker scores other chunks higher.
SGP will continue to add supported ranking strategies and models based on demand.
Finetune or Swap the Embedding Model
The initial embedding model chosen may be insufficient to understand the semantics of use-case specific queries. For example, if a lawyer asks the question “What is carry?” on a 100 page LPA document, it is extremely unlikely for any off the shelf embedding model to encode each chunk of the document in a way that includes enough information for the user query to semantically match the correct chunks.
Here are some signs that indicate that the embedding model should be finetuned:
- The correct chunks exist in the knowledge base, but the query cannot be easily matched by a human to these chunks
- The correct chunks do not consistently appear in the initial high-recall knowledge base query, so the reranker does not consistently see the correct chunks.
- The user wants to lower query latency for the application, so the recall size on the initial knowledge base query needs to be lowered, but the correct chunks still need to consistently appear in this smaller recall.
Finetune or Swap the Large-Langauge Model
There are situations where the synthesis of a final AI response is still not accurate even when the prompts contain enough data to respond to the user query. These problems can be fixed by finetuning the LLM that synthesizes the final response.
Here are some indications that the LLM should be finetuned:
- Even when a prompt contains sufficient information to respond to the query, the LLM says it cannot generate a proper response.
- The LLM generates a non-sensical answer instead of saying it cannot respond to the user query.
- Security and responsible AI vulnerabilities have been uncovered by red-teaming and the LLM needs to be finetuned to not respond to malicious user queries.
Iterative Prompt Engineering
Prompt engineering happens entirely client side, so users have maximum flexibility to modify and test various prompts as needed.
FAQ
Appendix
End to End Code Snippet