This project builds a local AI assistant using Qdrant Edge and RAG, enabling fast, private, and accurate responses by combining vector search with LLMs without heavy cloud use.

Modern AI systems are powerful, but most of them rely heavily on cloud infrastructure. Every query is sent over the internet, processed remotely, and returned back to the user. While this works fine, the process introduces latency, privacy risks, and dependency on external services. In sensitive domains like legal or personal data, this becomes a serious limitation.
This project demonstrates how to build a fully functional AI assistant that runs locally, using Qdrant Edge for vector search and a lightweight embedding model. The system retrieves relevant information from local documents and uses an LLM only for generating responses, ensuring both efficiency and control.
Vector databases are the backbone of any RAG system. Popular options include Pinecone, Weaviate, FAISS, and Qdrant. However, most of them either:
Qdrant Edge stands out because:
In simple terms:
Qdrant Edge gives you the power of a production-grade vector database, but small enough to run on your laptop.


project/
│
├── models/ # Embedding model cache
├── qdrant_data/ # Local vector DB storage
├── data.txt # Legal dataset
├── ingest.py # Data → Embeddings → Qdrant
├── main.py # API + RAG pipeline
├── requirements.txt
└── README.md
This structure separates concerns clearly:

This is a classic RAG pipeline, where retrieval improves the accuracy of generation.
Data Ingestion (ingest.py)
The ingestion pipeline is responsible for converting raw text into a searchable vector format. This is one of the most critical parts of the system because the quality of retrieval directly depends on how well the data is prepared.
text_model = TextEmbedding(
model_name="BAAI/bge-small-en",
cache_dir="./models"
)
Embeddings are the foundation of semantic search. Instead of matching keywords, the system converts text into dense numerical vectors in a high-dimensional space.
The chosen model, BGE-small, is particularly effective because:
When a sentence is passed through this model, it is transformed into a vector such that similar meanings lie closer in vector space. This enables semantic retrieval instead of simple keyword matching.
def chunk_text(text: str):
sections = re.split(r"\n\d+\.\s", text)
chunks = []
for sec in sections:
sec = sec.strip()
if sec:
chunks.append(sec)
return chunks
Chunking is essential because LLMs and embedding models perform better on smaller, meaningful pieces of text rather than entire documents.
Here, the document is split based on numbered sections (e.g., "1.", "2."). This ensures:
Poor chunking can lead to irrelevant or incomplete answers, so this step is crucial.
EdgeShard.create(
SHARD_PATH,
EdgeConfig(
vectors={
VECTOR_NAME: EdgeVectorParams(size=384, distance=Distance.Cosine)
}
),
)
This initializes the local vector database.
Key concepts:
Qdrant internally uses advanced indexing (like HNSW), which allows fast nearest-neighbor search even with large datasets.

def insert_data():
edge_shard = create_fresh_shard()
text = load_data()
chunks = chunk_text(text)
points = []
for index, chunk in enumerate(chunks):
stable_id = hashlib.md5(
f"{COLLECTION}:{index}:{chunk}".encode("utf-8")
).hexdigest()
point = Point(
id=stable_id,
vector={VECTOR_NAME: embed(chunk)},
payload={
"content": chunk,
"chunk_index": index,
"source": DATA_FILE
}
)
points.append(point)
edge_shard.update(UpdateOperation.upsert_points(points))
print(f"Inserted {len(points)} structured sections")
Each chunk is stored as a Point, which contains:
The payload is important because after retrieval, we need to show meaningful text, not just vectors.

This file orchestrates the entire system: query processing, retrieval, and response generation.
MODEL_NAME = "BAAI/bge-small-en"
VECTOR_NAME = "text"
COLLECTION = "legal_docs"
TOP_K = 10
OPENROUTER_MODEL = "meta-llama/llama-3-8b-instruct"
BASE_PATH = "./qdrant_data"
SHARD_PATH = os.path.join(BASE_PATH, COLLECTION)
logging.basicConfig(level="INFO")
text_model = TextEmbedding(model_name=MODEL_NAME, cache_dir="./models")
def embed(text: str):
return list(text_model.embed([text]))[0].tolist()
edge_shard: Optional[EdgeShard] = None
def get_edge_shard():
global edge_shard
if edge_shard:
return edge_shard
if not os.path.exists(SHARD_PATH):
raise RuntimeError("Run ingest.py first.")
edge_shard = EdgeShard.load(SHARD_PATH)
return edge_shard
llm_client: Optional[OpenAI] = None
def get_llm_client():
global llm_client
if llm_client:
return llm_client
api_key = os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")
if not api_key:
raise RuntimeError("Missing API key")
llm_client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=api_key,
)
return llm_client
def generate_answer(messages):
client = get_llm_client()
response = client.chat.completions.create(
model=OPENROUTER_MODEL,
messages=messages
)
return response.choices[0].message.content
This file acts as the central orchestrator of the system, managing how a user query is processed, how relevant information is retrieved, and how the final response is generated. It connects all components of the RAG pipeline to ensure smooth and accurate functioning.
The embedding model used is BAAI/bge-small-en, which converts both user queries and stored document chunks into dense vector representations. This allows the system to understand semantic meaning rather than relying on simple keyword matching. By ensuring that both queries and documents exist in the same vector space, the system can accurately measure similarity and retrieve the most relevant information.
The system uses the following LLM for response generation:
OPENROUTER_MODEL = "meta-llama/llama-3-8b-instruct"
This language model is responsible for generating the final answer in a human-readable format. Unlike traditional chatbots, it does not rely on its internal knowledge alone but strictly uses the retrieved context from the database. This approach minimizes hallucination and ensures that responses remain accurate and grounded in the provided data.

results = shard.query(
QueryRequest(
query=Query.Nearest(vector, using=VECTOR_NAME),
limit=TOP_K,
with_payload=True
)
)
This step forms the core of the retrieval system, where the user query is matched against stored document embeddings to find the most relevant information. Instead of relying on keyword matching, the system uses semantic similarity, which allows it to understand the meaning behind the query.
This step forms the core of the retrieval system, where the user query is matched against stored document embeddings to find the most relevant information. Instead of relying on keyword matching, the system uses semantic similarity, which allows it to understand the meaning behind the query.
When a user enters a query, it is first converted into a vector using the same embedding model that was used during ingestion. Qdrant then performs a nearest-neighbor search in the vector space using cosine similarity, identifying the closest vectors that represent semantically similar document chunks. The TOP_K parameter ensures that only the most relevant results are retrieved, while with_payload=True allows access to the original text associated with each vector.
This process transforms the system from a basic chatbot into a context-aware retrieval system capable of understanding intent rather than just matching words.
Prompt Engineering
PROMPT_TEMPLATE = """You are a legal assistant.
Use ONLY the provided context.
"""
Prompt engineering plays a crucial role in controlling the behavior of the language model. Since LLMs are inherently generative and may introduce information beyond the provided data, a well-designed prompt ensures that responses remain accurate and grounded.
In this system, the prompt explicitly instructs the model to act as a legal assistant and restrict its responses to the retrieved context only. This prevents hallucination and ensures that the generated output is strictly based on the underlying documents. Additional instructions in the full prompt template guide the model on how to handle summaries, greetings, and unrelated queries, making the assistant more structured and predictable.
This step is essential for maintaining trustworthiness and reliability in the system.
response = client.chat.completions.create(
model=OPENROUTER_MODEL,
messages=messages
)
Once the relevant context is retrieved and structured into a prompt, the language model generates the final response. The LLM processes both the user query and the retrieved context together, allowing it to produce a coherent and meaningful answer.
Unlike traditional AI systems that rely purely on pre-trained knowledge, this approach ensures that the model’s output is grounded in real data. The retrieved context acts as a source of truth, while the LLM focuses on transforming that information into a clear and user-friendly response. This combination of retrieval and generation significantly improves accuracy and reduces the risk of incorrect or fabricated answers.
@app.get("/ask", response_model=AskResponse)
def ask(q: str):
return AskResponse(answer=rag(q))
The API layer serves as the interface between the backend system and external users or applications. By exposing the RAG pipeline through a simple REST endpoint, the system becomes easily accessible and integrable.
When a request is sent to the /ask endpoint, the query is passed through the entire pipeline: embedding, retrieval, prompt construction, and answer generation. The final response is then returned in a structured format. This abstraction allows the system to be seamlessly connected to web interfaces, mobile apps, or other services without modifying the core logic.
This design makes the system modular and extensible, allowing it to evolve into a full-scale production application.
The following examples demonstrate how the system responds to different types of user queries through the built-in interface. The assistant is able to understand the intent of the query, retrieve relevant context from the local database, and generate accurate responses grounded in the document.




This project demonstrates how a complete retrieval-augmented AI assistant can be built without heavy cloud dependency by combining efficient local vector search with controlled language model generation. The core idea is straightforward yet powerful: convert documents into embeddings, retrieve the most relevant context using similarity search, and generate accurate responses grounded strictly in that context. This approach ensures faster performance, improved privacy, and greater reliability, especially for sensitive domains like legal or enterprise data. It also opens up opportunities to extend this system to various use cases such as internal knowledge bases, document assistants, or offline AI tools.
The complete code, along with setup instructions and implementation details, is available on:
https://github.com/AbhinayaPinreddy/qdrant_Edge_RAG
At Superteams.ai, we focus on building practical and scalable AI solutions tailored to real-world business needs. This project reflects that direction by showcasing how edge AI, vector search, and retrieval-based generation can be combined to create efficient, secure, and production-ready AI assistants.
Qdrant Docs: https://qdrant.tech/documentation/
Qdrant Edge GitHub: https://github.com/qdrant/qdrant