Key Takeaway
The key takeaway of RAG is that, is set up accurately, customers can customize LLMs to their applications without modifying any of their traditional data storage methods. By simply building RAG pipelines, which are analogous to traditional ETL pipelines, data can just be maintained at the source, and LLM applications will inherit this knowledge automatically.How to do retrieval in SGP
In this section we dive right into things. For more information about what retrieval is and how it works, scroll to the FAQ below. SGP Knowledge Bases The first step of retrieval is to load custom data into a vector database. Vector databases allow users to search unstructured data using natural language queries. In SGP, our Knowledge Base API manages the entire lifecycle of data ingestion into vector databases on behalf of users. This means that a user only has to manage data at the source. Simply create a knowledge base to reflect the source and periodically upload data to it. SGP will take care of all of the underlying challenges, so users don’t have to:- Vector Index Creation and Management
- Optimize shard density and size for performance
- Automatically create new indexes when optimal index sizes are exceeded
- Multiple Data Source Integrations
- Supports Google Drive, S3, Sharepoint, Direct JSON Upload, and more.
- Smart File-Diff Uploads
- Delete artifacts deleted from source
- Re-index artifacts modified at source
- Do nothing for artifacts unchanged at source
- Worker parallelization
- Scale ingestion horizontally to maximize throughput
- SGP can ingest documents at ~500MB/hour using less than 100 worker nodes. This throughput can easily increased by bumping the number of nodes for both the ingestion workers and the embedding model. Throughput can also improve by optimizing the hardware the embedding model is hosted on.
- Autoscaling
- Autoscale ingestion workers to lower costs during dormancy and burst for spiky workloads
- Text extraction
- Automatically extract text from non-text documents, i.e. DOCX, PPTX, PDF
- Chunking
- Select from a list of SGP supported chunking strategies to automatically split data into chunks during ingestion
- Easily swap out chunking strategies just by varying a small API request payload
- Embedding
- Automatically embed each chunk of text into a vector for storage in the vector DB
- Create knowledge bases with different embedding models to test how different embedding models affect retrieval performance
Python
Python
Python
Python
top_k
chunks of text most semantically
similar to the query.
Python
top_k
to a large number) and then use a cross-encoder model to re-rank the chunks before
choosing the top-scoring chunks to append to the user query for the final LLM prompt.
Python
Python
Python
Python
- Choose an embedding model*
- Create a knowledge base using that embedding model
- Choose a chunking strategy*
- Ingest data into the knowledge base directly or from a supported data source using the chosen embedding model and chunking strategy
- Choose a reranking model*
- Rerank chunks queried from the knowledge base using the chosen reranking model
- Choose an LLM*
- Engineer a retrieval-augmented prompt*
- Generate a response using the chosen LLM
Python
- The query can be easily matched by a human to chunks that are available in the knowledge base
- The correct chunks to retrieve are available in the knowledge base, but the reranker scores other chunks higher.
Python
- The correct chunks exist in the knowledge base, but the query cannot be easily matched by a human to these chunks
- The correct chunks do not consistently appear in the initial high-recall knowledge base query, so the reranker does not consistently see the correct chunks.
- The user wants to lower query latency for the application, so the recall size on the initial knowledge base query needs to be lowered, but the correct chunks still need to consistently appear in this smaller recall.
Python
- Even when a prompt contains sufficient information to respond to the query, the LLM says it cannot generate a proper response.
- The LLM generates a non-sensical answer instead of saying it cannot respond to the user query.
- Security and responsible AI vulnerabilities have been uncovered by red-teaming and the LLM needs to be finetuned to not respond to malicious user queries.
Python
FAQ
What is retrieval and why is it important for enterprise?
What is retrieval and why is it important for enterprise?
Large Langauge Models have inspired an exciting new wave of AI applications. However, incorporating
LLM capabilities into enterprise applications is more challenging. It takes millions of dollars and
very-specific expertise to build a foundation model. So, how can companies get ChatGPT-like
functionality running on their enterprise data?Currently, one of the most effective ways to do this is through Retrieval Augmented Generation
(RAG). RAG is a technique where users retrieve custom information from a data source and
append this information to their LLM prompt. For example, if a users asks:
How do we think Company X will perform this quarter?a RAG application would search various data sources (articles, databases, etc.) for information about Company X and create a prompt such as the following:
Instructions: Please answer the following user query using the additional context appended below. User Query: How do we think Company X will perform this quarter? Additional Context Retrieved from internal sources: Expected 2023 stock growth: +10%, Actual 2023 stock growth: +20% Source: internal_database Restrictions on international exports of Product Y are expected to slow down Company X’s growth. Source: external_news/international_exports_restricted_on_product_y.pdf Publication Timestamp: 1 week ago Author Z recommends that investors hold current assets, but temper expectations and slow down stock purchases. Source: internal_article_source/recommendations/author_z.pdf Publication Timestamp: 3 minutes agoFor the original user prompt, an off-the-shelf LLM would produce the following response:
I’m sorry I could not find any information about how Company X will perform this quarter. As an AI model I am unable to make future predictions.For the Retrieval Augmented prompt, the same off-the-shelf LLM would now be able to generated a reasonable response:
According to data retrieved from internal sources, Company X has double stock growth estimates so far in 2023, however, is expected to slow down due to restrictions imposed last week on exports of Product Y. Author Z recommends that investors hold current assets and not execute additional stock purchases.LLMs have a unique capability to do what is called “in-context learning”. This means that even if they weren’t trained on a specific piece of data, if it is given that data live in the prompt, it is generally capable of interpreting that data and using it to answer a user query.
Appendix
End to End Code SnippetPython