Retrieval-augmented generation (RAG) pipelines are a multi-step architecture that processes, stores, and retrieves a specific data set that is then used with large language models (LLMs) to make them more accurate and to include custom data in responses. Many companies find it a useful tool for keeping the data in their generative AI applications up to date and proprietary while also improving the quality of responses. Some of the more common applications that leverage RAG pipelines are chatbots, customer Q&A tools, or applications that rely on customer-specific information for success.
LLMs are revolutionary because they use massive amounts of data to create their models and can respond to many different queries. However, LLM applications can only work with the training data used to create the model. If you need to add new or custom data, you must retrain the model. This is called fine-tuning. RAG applications let you keep your already trained LLM and add new data that, as the name says, augments the retrieval of responses.
Why Are RAG Pipelines a Popular Technology?
Although LLMs are quickly proving themselves to be a disruptive technology across industries, they are not perfect. LLM applications work by looking at a large collection of text and predicting the next most likely word in a text string or line of software code. Sometimes, the algorithm predicts wrong. Those wrong guesses, euphemistically referred to as hallucinations, occur between 2.5% and 8.5% of the time on popular LLMs.
Another problem with LLMs is that training them is time-consuming. When new information comes along, it doesn’t make sense to retrain or even fine-tune until you have a large amount of new data. Also, if you want to include proprietary data, say, customer information or your own internal tech support data into your model, everyone with access to the model now has a view into your data. The same is true for domain-specific data.
On the other hand, if you train an AI system on a small set of new, domain-specific, or proprietary data, you will miss out on the massive benefits of an LLM, including natural language processing (NLP), impressive grammar, and broader context. RAG systems allow you to include new data sources for accuracy and contextualization while keeping the advantage of an LLM. In a way, a RAG pipeline automates prompt engineering.
Another appealing aspect of RAG pipelines is how the data is ingested into the system. Summarizing or translating your input from the source data is unnecessary. RAG pipelines read docs, SQL databases, and websites and convert the data into a useful format. With a few lines of Python, an application can read and process data in a multitude of source formats.
In summary, RAG pipelines are popular for the following reasons:
- Information sources can be easily ingested into the data model
- You can add new information without retraining the whole LLM
- Confidential or proprietary information can be easily included in a model
- It improves the accuracy of LLMs by reducing the likelihood of hallucinations
How Does a RAG Pipeline Work?
The basic structure of a RAG pipeline consists of steps to pull data and store it and then steps to turn a user query into optimized and contextualized input to the LLM. Every RAG system has different steps and tools, but they all mostly fall into the following five stages:
1. Loading
This is the information retrieval step where a variety of software tools are used to pull data from source documents, databases, websites, or other programs through APIs. There are hundreds of existing open source tools for loading. Common ones include routines to load PDFs, scrape web pages, or use SQL to query data from relational databases.
2. Indexing
Once data is loaded, it is then indexed for later retrieval. In most cases, this consists of using vector embedding to convert each piece of data into a numerical representation and other metadata representations that make it very easy to search and find contextually relevant data.
3. Storing
Once indexed, the vector values and metadata are stored in a vector database. For small datasets, you can index in real-time with each query, but it is much more efficient to store the results of the indexing one time.
4. Querying
Instead of sending a prompt directly to the LLM, a RAG pipeline takes the query, does a semantic vector search on the vector database, and returns relevant information. This is called retrieval. The RAG application then generates a new prompt for the LLM and sends the original prompt, the new query, and relevant data retrieved from the vector database. This step is called generation. The results of the LLM query are then placed into a response and returned to the user.
5. Evaluation
Since the quality of the response returned to the user is driven by the quality of the retriever and generator in the querying step, a final step called evaluation assesses the quality of output from both the retriever and generator output. You use evaluation to calculate metrics that show how relevant, accurate, and fast your RAG implementation is.
RAG Pipeline Case Studies
The best way to understand where RAG pipelines are a good fit with an LLM application is to look at a few examples. The advantages of RAG pipelines drive its application towards situations where you need contextual data that is not in your LLM model. Most applications revolve around working with a specific user, usually an internal or external customer, who needs data unique to them or some aspect of your business. A question-answering workflow helps the user quickly retrieve information with minimal hallucination.
Here are a few general examples of some typical applications:
Medical Diagnostics
One area where hallucinations are dangerous and where specific data is available is Medical Diagnostics. If you prompt a general-purpose LLM like GPT-4, it will pull its response from various places, including websites that have old data or data that has not gone through a peer review. To overcome this, developers are creating RAG pipelines trained on vetted clinical data.
In one example from Singapore General Hospital, researchers built a RAG system that provided preoperative guidelines to surgeons. They used Python frameworks, including LangChain and LlamaIndex, for the ingestion of the clinical data into chunks that were then indexed in a vector store. Multiple LLMs were added to the pipeline, including GPT 3.5, GPT 4.0, and Llama2.
During testing against human responses to the same queries, they generated results in 15-20 seconds compared to 10 minutes for humans. The LLM-only accuracy was 80.1%. Adding a RAG pipeline bumped accuracy to 91.4%. Humans were accurate 86.3% of the time. The study concluded that with additional clinical data fed into the knowledge base, the accuracy can improve further.
FAQ-Powered Chatbot
Chatbots are the most common application for RAG systems because they can combine the power of a public LLM with customer and company-specific historical data. A great example is how the data scientist at Flipkart, an e-commerce company in India, explored the use of RAG pipelines to convert their frequently asked question (FAQ) capability for credit card applications into a chatbot.
Their domain-specific data consisted of 72 FAQs, and they used a version of GPT 3.5 for their LLM. After some experimentation, this new system's accuracy was 100%. In addition, they used the RAG pipeline to discover out-of-context prompts and skip using the LLM. Since they use a commercial LLM, they saved on the cost of tokens along with more accurate responses for their customers. They could answer user questions directly instead of making the users read an FAQ while reducing their spending with their LLM provider.
E-Commerce Site Search
Searching on an e-commerce site is another common way to use RAG pipelines. A RAG pipeline can ingest customer-specific data in real-time. Standard search engines use an algorithm that uses keywords and the popularity of a given piece of data. LLMs go further to find the most probable answer. Adding a RAG pipeline optimizes the search process, giving personalized and relevant results that increase click-through rates and conversions to sales.
RAG Pipeline Implementation Challenges
Although RAG pipeline implementation is fairly simple, it does have its challenges. Here are the most common obstacles people run into:
- Picking the right algorithms and strategies: RAG pipelines are an architecture, not a set collection of tools. At every step in the pipeline, you will need to pick the appropriate algorithms. The sheer number of choices can be overwhelming.
- Managing a multitude of AI components: Once you have chosen the algorithms and strategies you want to use, you need to then select AI components and make sure they work together. It is easy for things to get complicated and for connections to fail, and finding bugs can be difficult.
- Getting data extraction right: The online step-by-step tutorials often show a Python script that points to a directory and reads a pile of PDF files. You may get lucky and your extraction could be that simple, but odds are you will have multiple data sources, APIs to other programs, and quality issues with the data. Figuring out extraction can take time if you are not working with a team that has done it before.
- Managing data: The heart of any RAG pipeline is the data, and any implementation will face challenges around skipping out-of-context data, cleaning out old information, and adding new data.
- Scaling software and hardware: Every step in a RAG pipeline uses computational and storage resources, both during ingestion and when responding to prompts. If your system doesn’t scale, users may experience unacceptable delays.
Best Practices for Managing and Optimizing Your RAG Pipelines
At Focused Labs, we have helped many customers implement generative AI technology across industries and application platforms, including many projects that took advantage of RAG pipelines. Here are a few of the lessons we have learned:
- Make regular updates to data and technology: Your responses are only as good as your data. Put processes in place to evaluate and update that data. The technology in the pipeline also evolves, so put steps in place to keep each module up to date and replace modules when appropriate with newer, better technology. We recently did this with an in-house RAG application, upgrading to the LangChain Expression Language (LCEL), and saw great results.
- Diversify data sources: Just as with an LLM, the quality of the responses retrieved from your vector database increases with the diversity of data sources. Adding new collections can help weight responses as well as increase accuracy.
- Experiment and evaluate: The first pass at a RAG pipeline is just the start. Build evaluation and experimentation into your development and maintenance processes. Play with parameters, data sources, and component technology to optimize the speed and quality of your pipeline.
- Work with experts: This area of AI is a bit of a specialty. There are a multitude of online training resources and an overwhelming amount of open source tools. But none of those replace spending time with a team who has implemented the technology before, knows all the terminology, and, most importantly, understands the theory behind not just RAG pipelines, but LLMs, API implementation, and fundamental software development methods. Working with a partner like Focused Labs can be a game changer.
- Use good DevOps practices: Although unique in many ways, a RAG pipeline’s development and deployment are still software projects, and good DevOps practices can make an implementation go much smoother.
Getting Started With Implementing a RAG Pipeline
For the right application, we have found RAG pipelines to be an efficient and flexible solution to improve accuracy and easily incorporate specific data into generative AI applications. Once you have identified an application that will benefit from this technology, we recommend the following steps:
- Pick your LLM
- Identify your data sources
- Choose your pipeline components
- Build an MVP
- Test, debug, and improve
- Build out the full application
- Deploy
- Establish a maintenance and updating process
You have the contextual data, you have the application, and now you just need to build it. And our team at Focused Labs is eager to help. We are a collection of experts who love their craft, know how to listen and collaborate, and are obsessed with asking questions and learning continuously. We don’t just show up, drop off code, and leave. We design and build solutions as a partner with our customers and also train them as we go so when we are finished, you can continue the work. Reach out and let’s build something together!