How to Implement Naive RAG, Advanced RAG, and Modular RAG

Introduction

Any RAG framework addresses the following questions:

“What to retrieve”
“When to retrieve”
“How to use the retrieved information”

Over the last few years, there has been tremendous innovation in the RAG space. RAG systems can be divided into 3 categories:

Naive RAG
Advanced RAG
Modular RAG

In this blog, we will explain what they mean and how they compare against one another.

Getting Started

Let’s get started with E2E Networks, our GPU cloud provider of choice. To start, log into your E2E account. Set up your SSH key by visiting Settings.

After creating the SSH key, visit Compute to create a node instance.

Open your Visual Studio code, and download the extension Remote Explorer and Remote SSH. Open a new terminal. Login into your local system with the following code:

ssh root@<your-ip-address>

With this, you’ll be logged in to your node.

An Overview of Naive RAG

Naive RAG is a paradigm that combines information retrieval with natural language generation to produce responses to queries or prompts. The core idea is to leverage retrieved information to enhance the context of the LLM without sophisticated strategies or techniques. The process typically involves 3 main steps in the Naive RAG framework: indexing, retrieval, and generation.

Indexing: Indexing is the initial step of processing and organizing the available data sources to facilitate efficient retrieval. In Naive RAG, various types of data sources can be indexed, including structured data like knowledge graphs or unstructured data like text corpora. Indexing involves creating a searchable representation of the data, often using techniques like tokenization, vectorization, or graph-based representations, depending on the nature of the data.
Retrieval: Once the data is indexed, the retrieval step involves selecting relevant information from the indexed data sources in response to a given query or prompt. In Naive RAG, retrieval is typically performed using retrieval models that rank the indexed data based on its relevance to the input query. Retrieval models can range from simple keyword matching or TF-IDF to more complex neural network-based approaches like dense retrieval or passage ranking. The retrieved information serves as the context or background for the subsequent generation step.
Generation: After retrieving the relevant information, the generation step involves synthesizing a response based on both the input query and the retrieved context. In Naive RAG, natural language generation models like large language models (LLMs) such as Mistral 7B and Llama 2 are commonly used. These models generate text based on the input query and the retrieved context, aiming to produce coherent and contextually relevant responses. The generation process can involve techniques like autoregressive generation, where the model predicts the next word or token based on the preceding context, or transformer-based architectures that consider the entire context when generating responses.

Drawbacks of Naive RAG

While Naive RAG offers a promising approach to combining retrieval and generation for natural language processing tasks, it also comes with several drawbacks:

Limited Context Understanding: Naive RAG models may struggle to fully comprehend the context provided by the retrieved information. Although they can access relevant documents or passages, they may not effectively integrate this information into the generated responses, which leads to responses that lack coherence or relevance.
Semantic Drift: In some cases, the generated responses may deviate from the intended meaning due to the model's reliance on large-scale pre-training data. This phenomenon, known as semantic drift, can result in inaccurate or misleading responses that do not align with the retrieved context.
Difficulty Handling Complex Queries: Naive RAG models may struggle to handle complex queries or prompts that require nuanced understanding or multi-step reasoning. Retrieval-based approaches may retrieve relevant information, but the generation model may not be able to synthesize accurate responses, especially for queries that involve inferencing or domain-specific knowledge.
Over-Reliance on Retrieved Information: Naive RAG models may overly rely on the retrieved information, which leads to responses that are repetitive or redundant. This dependency on retrieved context can limit the model's ability to generate novel or diverse responses, particularly in scenarios with limited training data or narrow domains.

Implementation of Naive RAG

To get started with Naive RAG, we choose a technical report of Stable Diffusion, Qdrant vector database, and Mistral 7B language model.

# Create directory and download file
!mkdir data
!wget https://arxiv.org/pdf/2403.03206.pdf -P data

# Import necessary modules and classes
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
import qdrant_client
import torch
from typing import Optional

# Load data from documents and split into smaller chunks
documents = SimpleDirectoryReader('./data').load_data()
Splitter = SentenceSplitter(chunk_size=512)
text_chunks = []
doc_idxs = []  # maintain relationship with source doc index
for doc_idx, doc in enumerate(documents):
    cur_text_chunks = Splitter.split_text(doc.text)
    text_chunks.extend(cur_text_chunks)
    doc_idxs.extend([doc_idx] * len(cur_text_chunks))

# Create TextNode instances for each chunk and associate metadata
from llama_index.core.schema import TextNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
    node = TextNode(text=text_chunk)
    src_doc = documents[doc_idxs[idx]]
    node.metadata = src_doc.metadata
    nodes.append(node)

# Embed each text chunk using Hugging Face model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
for node in nodes:
    node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
    node.embedding = node_embedding

# Initialize LLM
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    tokenizer_name="mistralai/Mistral-7B-v0.1",
    model_name="mistralai/Mistral-7B-v0.1",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)
Settings.llm = llm
Settings.chunk_size = 512
Settings.embed_model = embed_model

# Initialize vector store and index documents
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
vector_store.add(nodes)

# Define query string
query_str = "What is stable diffusion?"
query_embedding = embed_model.get_query_embedding(query_str)

# Perform similarity search on vector store based on the query
from llama_index.core.vector_stores import VectorStoreQuery
query_mode = "default"
vector_store_query = VectorStoreQuery(
    query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())

# Define custom retriever class for querying vector store
from llama_index.core.schema import NodeWithScore
from typing import List, Any

class VectorDBRetriever(BaseRetriever):
    """Retriever over a postgres vector store."""
    def __init__(
        self,
        vector_store: vector_store,
        embed_model: Any,
        query_mode: str = "default",
        similarity_top_k: int = 2,
    ) -> None:
        """Initialize parameters."""
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        super().__init__()

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        """Retrieve."""
        query_embedding = embed_model.get_query_embedding(query_bundle.query_str)
        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )
        query_result = vector_store.query(vector_store_query)

        nodes_with_scores = []
        for index, node in enumerate(query_result.nodes):
            score: Optional[float] = None
            if query_result.similarities is not None:
                score = query_result.similarities[index]
            nodes_with_scores.append(NodeWithScore(node=node, score=score))

        return nodes_with_scores

# Initialize query engine with custom retriever
retriever = VectorDBRetriever(
    vector_store, embed_model, query_mode="default", similarity_top_k=2
)

# Perform query using query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)
query_str = "What is Stable Diffusion?"
response = query_engine.query(query_str)
print(response)

Output:

Stable Diffusion is a generative model that can be used to generate images from text descriptions.
It is a type of diffusion model that uses a transformer architecture to generate images. The model is trained on a large dataset of images and text descriptions, and it learns to generate images that are similar to the ones in the dataset.
The model is able to generate images that are realistic and high-quality, and it can be used for a variety of applications, such as image generation, image editing, and image retrieval.

Understanding Advanced RAG

The Advanced Retrieval-Augmented Generation technique is built upon the foundation of Naive RAG by introducing various enhancements and optimizations throughout the retrieval and generation pipeline. These enhancements aim to improve the relevance, coherence, efficiency, and scalability of RAG systems. Let's delve into each component of Advanced RAG:

Pre-Retrieval Process: Advanced RAG systems may employ sophisticated techniques for indexing the data corpus to enhance retrieval efficiency and effectiveness. This can involve optimizing data structures, leveraging distributed indexing systems, or using specialized indexing algorithms tailored to the characteristics of the data. Techniques like inverted indexing, forward indexing, or hybrid indexing may be employed to efficiently map queries to relevant documents in large-scale datasets.
Embedding: Fine-tuning embedding refers to adapting pre-trained embeddings to better suit the specific task or domain of interest. Advanced RAG models may fine-tune embeddings to capture task-specific semantics or domain knowledge, thereby improving the quality of retrieved information and generated responses. Dynamic embedding techniques enable RAG models to adaptively adjust embeddings during inference based on the context of the query or retrieved information. This dynamic adjustment helps the model better capture the nuances of the input and generate more contextually relevant responses.
Post-Retrieval Process: After retrieving relevant documents or passages, advanced RAG systems may apply post-retrieval processing techniques to refine the retrieved information further. Re-ranking involves re-evaluating the relevance of retrieved documents based on additional criteria or features. This can help prioritize more informative or contextually relevant documents for subsequent generations. Prompt compression techniques aim to condense the retrieved information into a more concise and structured format suitable for input to the generation model. This compression process helps reduce noise and irrelevant content, improving the efficiency and quality of the generation process.
RAG Pipeline Optimization:
1. Hybrid Search: Advanced RAG systems may utilize a combination of different search strategies, such as keyword-based search, semantic search, and neural search, to improve retrieval accuracy and coverage. This hybrid approach leverages the strengths of each search method to enhance overall retrieval performance.
2. Recursive Retrieval and Query Engine: Recursive retrieval techniques involve iteratively refining queries based on the intermediate results obtained during retrieval. This iterative process helps retrieve more relevant information, especially in scenarios with complex or ambiguous queries.
3. Step-Back Prompt: Step-back prompt techniques enable RAG models to backtrack and revise their prompts based on the retrieved information. This iterative refinement process allows the model to adaptively adjust its input to better guide the generation process.
4. Subqueries: Subquery techniques involve breaking down complex queries into smaller, more manageable subqueries, which are then processed independently. This decomposition helps improve retrieval accuracy and efficiency by focusing on specific aspects of the query.
5. HyDE (Hybrid Document Embeddings): HyDE is a technique that combines traditional bag-of-words representations with neural embeddings to capture both syntactic and semantic information in the document corpus. This hybrid approach enhances retrieval performance by leveraging the complementary strengths of different embedding methods.

Implementation of Advanced RAG

As we discussed, there are many advanced techniques to build an advanced RAG application; here, just for instance, we have selected HyDE Query Transform for advanced RAG. We’ll use Mistral 7B LLM and the Singer text dataset.

# Import logging module for logging messages
import logging
import sys

# Configure logging to display INFO level messages on stdout
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# Import necessary modules and classes
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from IPython.display import Markdown, display

# Load data from documents
documents = SimpleDirectoryReader("./data").load_data()

# Initialize HuggingFaceEmbedding model for text embedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Initialize HuggingFaceLLM for language model
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
import torch

llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    generate_kwargs={"temperature": 0.7, "do_sample": False},
    tokenizer_name="mistralai/Mistral-7B-v0.1",
    model_name="mistralai/Mistral-7B-v0.1",
    device_map="auto",
    stopping_ids=[50278, 50279, 50277, 1, 0],
    tokenizer_kwargs={"max_length": 4096},
    model_kwargs={"torch_dtype": torch.float16}
)

# Set up settings for llama_index
Settings.llm = llm
Settings.chunk_size = 512
Settings.embed_model = embed_model

# Create VectorStoreIndex from documents
index = VectorStoreIndex.from_documents(documents)

# Define query string
query_str = "Who is Eminem?"

# Create query engine using VectorStoreIndex
query_engine = index.as_query_engine()

# Perform query and display response as bold text
response = query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

Response:

Eminem is an American rapper. He is credited with popularizing hip hop in Middle America and is often regarded as one of the greatest rappers of all time.
Eminem's global success and acclaimed works are widely regarded as having broken racial barriers for the acceptance of white rappers in popular music. While much of his transgressive work during the late 1990s and early 2000s made him a controversial figure, he came to be a representation of popular angst of the American underclass and has been cited as an influence by and upon many artists working in various genres.
Eminem is also known for collaborations with fellow Detroit-based rapper Royce da 5'9". He is also known for starring in the 2002 musical drama film 8 Mile, playing a dramatized version of himself. Eminem has developed other ventures, including Shady Records, a joint venture with manager Paul Rosenberg, which helped launch the careers of artists such as 50 Cent, D12, and Obie Trice, among others.
Eminem has also established his own channel, Shade 45, on Sirius XM Radio. Eminem is among the best-selling music artists of all time, with estimated worldwide sales of over 220 million records.

The above query was without a HyDE transformation; let’s perform the transformation and see the response.

hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))

Response:

Given the context information and not prior knowledge, answer the query. Query: Who is Eminem? Answer: Eminem is an American rapper.

The response is quite straightforward and impressive. Let’s generate a hypothetical document using HyDE and its embeddings.

query_bundle = hyde(query_str)
hyde_doc = query_bundle.embedding_strs[0]

hyde_doc

Response:

'Eminem is an American rapper, songwriter, and record producer. He was born in Detroit, Michigan, and began his career in the early 1990s. Eminem is known for his rapid-fire delivery, dark humor, and controversial lyrics. He has won numerous awards, including 11 Grammy Awards, and has sold over 200 million records worldwide.
Eminem has also been involved in several high-profile legal battles, including a lawsuit over the use of his name and likeness in a video game. Despite his success, Eminem has faced criticism for his use of offensive language and his treatment of women.

\n"""\n\nQuestion:\nWho is Eminem?\n\n\n\nWe can use the property of transitivity to infer that Eminem is a rapper, songwriter, and record producer.\n\nWe can use inductive logic to infer that Eminem is known for his rapid-fire delivery, dark humor, and controversial lyrics.

\n\nWe can use deductive logic to infer that Eminem has won numerous awards, including 11 Grammy Awards, and has sold over 200 million records worldwide.

\n\nWe can use proof by exhaustion to eliminate other possibilities and conclude that Eminem is an American rapper, songwriter, and record producer who is known for his rapid-fire delivery, dark humor, and'

As the chunk size is limited to 512, the generation is also limited, but the hypothetical document is also good. This is how advanced RAG helps in improving the answers to our queries leading to straightforward knowledge.

Exploring Modular RAG

Modular RAG refers to an approach where retrieval-augmented generation systems are designed and implemented in a modular fashion, which allows the incorporation of various modules to enhance performance, flexibility, and adaptability. These modules introduce new functionalities and patterns that contribute to the overall effectiveness of the RAG system. Here's an explanation of each new module and pattern:

New Modules:
1. Search Module: The Search Module is responsible for retrieving relevant information from a database or corpus in response to a given query or prompt. It may utilize advanced search algorithms, such as keyword matching, semantic similarity, or machine learning-based retrieval models, to retrieve the most relevant documents or passages.
2. Memory Module: The Memory Module acts as a storage mechanism for storing relevant information retrieved during the search process. It enables the RAG system to access previously retrieved context or information during the generation phase, which facilitates coherence and consistency in the generated responses.
3. Extra Generation Module: The Extra Generation Module supplements the primary generation process by providing additional generation capabilities, such as paraphrasing, summarization, or context expansion. It enriches the generated responses with diverse and complementary content, by enhancing the overall quality and relevance of the output.
4. Task Adaptable Module: The Task Adaptable Module allows the RAG system to dynamically adapt to different tasks or domains by adjusting its retrieval, generation, or processing strategies. It enhances the system's versatility and applicability across a wide range of natural language processing tasks, including question-answering, summarization, and dialogue generation.
5. Alignment Module: The Alignment Module ensures coherence and consistency between the retrieved context and the generated responses. It aligns the generated content with the retrieved information, by ensuring that the responses accurately reflect the context provided by the retrieval module.
6. Validation Module: The Validation Module evaluates the quality and relevance of the generated responses by comparing them against predefined criteria or benchmarks. It helps identify errors, inconsistencies, or biases in the generated output, by enabling iterative improvements to the RAG system.
New Patterns:
1. Adding or Replacing Modules: Modular RAG allows for the addition or replacement of modules based on specific requirements or objectives. For example, researchers may experiment with different retrieval algorithms or generation models to improve performance or adaptability.
2. Adjusting the Flow Between Modules: Modular RAG enables flexible control over the flow of information between modules. Researchers can optimize the interaction between the search, retrieval, and generation modules to achieve desired outcomes, such as minimizing latency, maximizing relevance, or enhancing coherence.

Implementation of Modular RAG

It helps a lot when the command of customization of your RAG application is in your hand. This is what Modular RAG does. You are free to create your modules and patterns, customize them according to your needs, and voila! Your Modular RAG application is ready.

For example, Verba, an open-source modular RAG application, is fully customizable and adaptable. Verba's modular architecture allows users to customize the RAG pipeline according to their specific needs.

For a RAG application, we generally need a Document reader, Chunker, Embedding generator, Retriever, and Generator. Let’s break them down.

Document Reader: To start with the ingestion process, it is necessary to handle all types of data such as plain text, JSON, CSV, PDF documents, and more. This is the most important part where your data engineering skills will come in handy. The main objective is to convert the different data types into a unified structure for further processing.
Chunker: When the data is converted into a unified structure, the next process is to chunk them. But, why are we chunking the documents? When we ask queries from the RAG application, we need a specific bit of information rather than the whole document knowledge. When we add the whole document into the process without chunking, the number of tokens is high and LLMs have to sift through the whole document, and the accuracy of the results is affected negatively. To avoid this, we chunk the document by dividing it into smaller sections. This results in the cutting down of the tokens as well as increases the chances of getting the exact answers.
Embedding Generator: After the chunking of the documents, we need them to convert them into embeddings. To store embeddings, we need vector databases. In our Modular RAG, we need to make two Modules, one for converting chunks into embeddings, and another for storing them into vector databases by initiating the choice of vector databases. There are various embedding models; the module should be supported for every embedding model. If you are personalizing your business, you can choose the embedding model and vector database of your choice.
Retriever: As the data is now fully organized and vector databases are created, the next process is to initiate the retriever. You can make this module powered with advanced search techniques or use the vector database module to initiate the retriever.
Generator: This is the final step of your RAG application, where you pass the LLM to generate the response. The module should be made accordingly to run the query powered by the retriever, the documents, embeddings, and the LLM, which is the main source of answer generation.

This is how you build a Modular RAG.

Conclusion

We saw how the different RAG approaches can affect the answers and the knowledge that we want to retrieve from the application. We leveraged E2E Networks V100 GPU to work with Mistral 7B LLM, the Qdrant Vector database, and different techniques of RAG. However, Advanced RAG gave a pretty straightforward answer which was quite fascinating. Thanks for reading!