Steps to Build a RAG Pipeline Using Gemma 2b LLM

Introduction

Gemma, Google's latest LLM offering in the AI world, isn't just another set of large language models. It's a family of "open models," meaning their core capabilities are freely accessible. Quite remarkable that Google released Gemma as an open model right after the recent Gemini upgrade.

Gemma is available in two sizes - 2 billion and 7 billion parameters - referred to as Gemma 2b and Gemma 7b. Gemma packs a punch, achieving state-of-the-art performance for its size compared to other open models like Llama2. The smaller model would even fit as LLM for laptops or desktops, making them ideal for developers to experiment with.

One vital thing Google has done with Gemma is that it has implemented rigorous standards to ensure safe and unbiased outputs, addressing concerns about potential misuse of large language models. Additionally, they are giving out free credits for research and development on platforms like Kaggle and Google Cloud.

In this article, we will explore Gemma 2b, and build a RAG application with it using LangChain and Qdrant, the vector db that we really enjoy using.

But before that, why do we need a RAG pipeline? Also, a technical brief about the Gemma model.

Technical Brief on Gemma Models

There are several things Google has done to ease the developer adoption, so that developers try it out.

Architecture and Framework Support

Multi-Framework Tools: Gemma offers reference implementations for inference and fine-tuning across various frameworks, including Keras 3.0, PyTorch, JAX, and Hugging Face Transformers.
Cross-Device Compatibility: Gemma runs across different devices, such as laptops, desktops, Internet of Things (IoT) devices, mobiles, and cloud environments.
Hardware Integration: Gemma leverages cutting-edge hardware platforms, particularly NVIDIA GPUs, to ensure optimal performance and integration with advanced technologies.

Performance and Safety Features

Model Variants: Gemma is offered in two sizes—Gemma 2B and Gemma 7B, each with pre-trained and instruction-tuned variants.
Safe and Responsible Generation: Gemma is pre-trained to filter out sensitive or personal information and uses reinforcement learning from human feedback (RLHF) to minimize the likelihood of generating harmful content.
Manual Testing: Manual red-teaming and adversarial testing are conducted to identify potential risks associated with Gemma.

Accessibility and Collaboration

Open Source: Gemma is freely available to the global community of developers and researchers.
Toolchain Support: Gemma provides toolchains for inference and supervised fine-tuning (SFT) across all major frameworks.
Ready-to-Use Resources: Gemma comes with ready-to-use Colab and Kaggle notebooks, along with integration with popular tools such as Hugging Face, MaxText, NVIDIA NeMo, and TensorRT-LLM.
Free Credits: First-time Google Cloud users receive $300 in credits, and researchers can apply for additional Google Cloud credits of up to $500,000.

As of now, Gemma focuses solely on text-to-text tasks, unlike its predecessor, Gemini, which is multimodal.

Despite this, Gemma seems to be showing great performance relative to its size, surpassing significantly larger models on key benchmarks while maintaining strict standards for safety and responsibility

Understanding RAG Pipeline

(You can skip this section if you are already familiar with RAG and why it is significant.)

While LLMs excel at processing language and generating creative text, they have limitations when it comes to grounding their responses in real-world facts and knowledge.

This is where RAG pipelines come in, playing a crucial role in "grounding" an LLM by addressing these limitations. They combine the strengths of two powerful AI components: Large Language Models (LLMs) and Vector Databases. Here's a breakdown of how they work:

Step 1 - Knowledge Preparation

Document Collection: Relevant documents, like articles, reports, or code, are gathered and stored in a database.

Vectorization: Each document is converted into a numerical representation called a "vector" using an embedding model. Imagine it as capturing the document's essence in a multi-dimensional space.

Vector Database: All the document vectors are stored efficiently in a specialized database designed for fast similarity searches.

Step 2. User Interaction

User Query: You ask a question or provide instructions to the LLM.

Query Vectorization: Your query is also converted into a vector using the same embedding model.

Step 3. Information Retrieval

Similarity Search: The LLM's query vector is compared against all document vectors in the database. Documents with the closest vector representations are retrieved, suggesting they contain relevant information.

Filtering & Reranking (Optional): Additional models can refine the retrieved documents based on relevance, context, or other criteria.

Step 4. Knowledge Integration

Contextual Enrichment: The retrieved documents are fed to the LLM as additional information. This helps the LLM understand the context of your query and access specific details from the relevant sources.

Step 5. Response Generation

LLM at Work: With the enriched context, the LLM generates a response. This could be an answer to your question, a creative text piece based on your instructions, or any other output the LLM is trained for.

Problem Statement: Building a RAG Pipeline to Query Resumes

In the following steps, we will CVs of a large number of candidates as our dataset. This is a great use-case because HR executives often have to traverse through thousands of candidate resumes to find the right candidate for a role.

With LLMs and RAG applications, it is possible to dramatically increase the productivity of a HR executive, and give them a simple tool to query resumes. We have actually deployed such a solution before with Mistral 7b LLM.

In this article, we will try the same with Gemma 2b. The size of the LLM is tiny in comparison, but hopefully the results will be just as powerful.

Step 1: Installation

Let’s use any notebook environment that you are comfortable with. In our case, we tend to spin up a node on Google Cloud, who have graciously given us free startup credits. In this case, we will use a 4xT4 node that we already had running. To connect to the remote node, we will use VS Code Remote Explorer extension, and create a ssh connection.

Even though the model can potentially be run on my laptop, I want to showcase the actual use-case of building on a GPU with sizeable memory and a large dataset.

Once you have installed the extension, then create a new ssh connection. The steps are obvious, and it will drop you into the machine. Then, I like to always create a folder called ‘workspace’ and then something like ‘gemma_rag’.

mkdir -p ~/workspace/gemma_rag

Create a notebook, say, ‘gemma_test.ipynb’ in that folder. That would give you a powerful GPU-backed notebook environment. Next, select a kernel. VS Code would allow you to create a virtual environment remotely when you select the kernel on the right. This will create a .venv folder in the same directory.

Now let’s install some libraries:

!pip install tiktoken
!pip install qdrant-client langchain pypdf

Follow that with import statements.

import os
import getpass
from operator import itemgetter
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import format_document
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA

I have created a folder called ‘data’ using VS Code itself. After that, I simply dragged about a thousand CVs into that folder. Good enough for a test.

Now, let’s load these up:

loader = PyPDFDirectoryLoader("./data")
docs = loader.load()
print(len(docs))

This should print the total number of docs that have been loaded from the directory. Next, we will use the HuggingFaceEmbeddings() function to convert these documents into vector embeddings. We will save that into a collection called ‘resumes’. Here, we are simply using Qdrant in ‘memory’ mode. Ideally, you should spin up Qdrant through docker, or use their cloud.

Make sure you have sentence-transformers installed though first.

!pip install sentence-transformers

Now proceed with vector embeddings conversion.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# initialise embeddings used to convert text to vectors
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

# Split documents into chunks so it fits in context

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 0)
all_splits = text_splitter.split_documents(docs)

# create a qdrant collection - a vector based index of all resumes
qdrant_collection = Qdrant.from_documents(
    all_splits,
    embeddings,
    location=":memory:", # Local mode with in-memory storage only
    collection_name="resumes",
)

# construct a retriever on top of the vector store
qdrant_retriever = qdrant_collection.as_retriever()

Next, we will try out the Gemma 2b model, before we bring the RAG pipeline together. To do that, we have to first ensure we are using upgraded transformer library.

!pip install -U "transformers==4.38.0" --upgrade

Next, we will test out the Gemma-2b. Before you can proceed with the next set of code, you would need to accept licenses and get access to the model. For that, you should head to the following URLs.

https://huggingface.co/google/gemma-2b-it -

…and then request access. For the 7b model, use the following url:

https://huggingface.co/google/gemma-7b-it

Once you have been granted access, you can either use your HuggingFace token, or you can download the model locally. We are going to do the former.

from transformers import AutoTokenizer, pipeline
import torch

hf_access_token = 'hf......'
model = "google/gemma-2b-it"

# Code below is to first test out the model

tokenizer = AutoTokenizer.from_pretrained(model, token=hf_access_token)
pipeline = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
    max_new_tokens=512
)

messages = [
    {"role": "user", "content": "Where is Milan?"},
]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(
	prompt,
	max_new_tokens=256,
	add_special_tokens=True,
	do_sample=True,
	temperature=0.7,
	top_k=50,
	top_p=0.95
)
print(outputs[0]["generated_text"][len(prompt):])

If you have done things right till now, you would see the following output.

Milan is located in the Lombardy region of Italy. It is the largest city in the region and the second-largest city in Italy after Rome.

Next, let’s move to stitch together a RAG pipeline using Qdrant and our resume dataset.

gemma_llm = HuggingFacePipeline(
    pipeline=pipeline,
    model_kwargs={"temperature": 0.7},
)

qa = RetrievalQA.from_chain_type(
    llm=gemma_llm,
    chain_type="stuff",
    retriever=qdrant_retriever
)

query = "Which resumes have Python experience?"
qa.invoke(query)

This should give you great results already, with your LLM grounded in data.

Now let’s put a Gradio UI on top.

!pip install --upgrade gradio

We will throw it all together now to build our RAG pipeline with a resume dataset.

import gradio as gr
import os
from shutil import copyfile

with gr.Blocks() as demo:
	chatbot = gr.Chatbot()
	msg = gr.Textbox()
	clear = gr.ClearButton([msg, chatbot])

	def respond(message, chat_history):
  		bot_message = qa.invoke(message)
  		chat_history.append((message, bot_message))
  		return "", chat_history

	msg.submit(respond, [msg, chatbot], [msg, chatbot])
demo.launch(share=True)

Results

One of the most promising facts about the Gemma models is that it performs extremely well despite low resources. This is powerful because we see LLMs as being the natural language engine in all sorts of scenarios, ranging from enterprise-grade chatbots, to local LLMs that run on your mobile phones and laptops. The former can help streamline enterprise workflows, while the latter will help you in your day to day life.

With Gemma, Google has demonstrated that it is possible to achieve benchmark-beating quality with lower sized models, and we expect it to take the crown from our other favorite two models - Mistral 7B and Llama2 7B.

Hope you enjoyed reading this. We will be shortly sharing another guide on how to fine-tune Gemma 2b and Gemma 7b for your specific needs. Stay tuned, and feel free to reach out to us via founders@superteams.ai if you have any questions or want to understand how LLMs can help your business.