Build an On-Device AI Assistant with RAG and Qdrant Edge

Modern AI systems are powerful, but most of them rely heavily on cloud infrastructure. Every query is sent over the internet, processed remotely, and returned back to the user. While this works fine, the process introduces latency, privacy risks, and dependency on external services. In sensitive domains like legal or personal data, this becomes a serious limitation.

This project demonstrates how to build a fully functional AI assistant that runs locally, using Qdrant Edge for vector search and a lightweight embedding model. The system retrieves relevant information from local documents and uses an LLM only for generating responses, ensuring both efficiency and control.

Why Qdrant Edge?

Vector databases are the backbone of any RAG system. Popular options include Pinecone, Weaviate, FAISS, and Qdrant. However, most of them either:

Require a cloud setup (Pinecone)
Need heavy configuration (Weaviate)
Lack structured payload support (FAISS)

Qdrant Edge stands out because:

It is designed specifically for on-device usage, unlike cloud-first solutions
It provides persistent storage, unlike FAISS which is mostly in-memory
It supports payload filtering + metadata, making it ideal for structured documents
It is lightweight yet production-ready, unlike experimental local solutions
It uses efficient indexing (HNSW) for fast similarity search

In simple terms: Qdrant Edge gives you the power of a production-grade vector database, but small enough to run on your laptop.

Project Structure

project/
│
├── models/           # Embedding model cache
├── qdrant_data/      # Local vector DB storage
├── data.txt          # Legal dataset
├── ingest.py         # Data → Embeddings → Qdrant
├── main.py           # API + RAG pipeline
├── requirements.txt
└── README.md

This structure separates concerns clearly:

Data layer → data.txt
Processing layer → ingest.py
Serving layer → main.py

System Flow

This is a classic RAG pipeline, where retrieval improves the accuracy of generation.

Data Ingestion (`ingest.py`)

The ingestion pipeline is responsible for converting raw text into a searchable vector format. This is one of the most critical parts of the system because the quality of retrieval directly depends on how well the data is prepared.

Embedding Model

text_model = TextEmbedding(
    model_name="BAAI/bge-small-en",
    cache_dir="./models"
)

Embeddings are the foundation of semantic search. Instead of matching keywords, the system converts text into dense numerical vectors in a high-dimensional space.

The chosen model, BGE-small, is particularly effective because:

It balances accuracy and speed
Produces 384-dimensional vectors, which are compact yet expressive
Works efficiently on CPU, making it ideal for edge devices

When a sentence is passed through this model, it is transformed into a vector such that similar meanings lie closer in vector space. This enables semantic retrieval instead of simple keyword matching.

Chunking Strategy

def chunk_text(text: str):
    sections = re.split(r"\n\d+\.\s", text)

    chunks = []
    for sec in sections:
        sec = sec.strip()
        if sec:
            chunks.append(sec)

    return chunks

Chunking is essential because LLMs and embedding models perform better on smaller, meaningful pieces of text rather than entire documents.

Here, the document is split based on numbered sections (e.g., “1.”, “2.”). This ensures:

Each chunk represents a logical legal section
Context remains structured and meaningful
Retrieval becomes more precise and relevant

Poor chunking can lead to irrelevant or incomplete answers, so this step is crucial.

Creating the Qdrant Edge Shard

EdgeShard.create(
    SHARD_PATH,
    EdgeConfig(
        vectors={
            VECTOR_NAME: EdgeVectorParams(size=384, distance=Distance.Cosine)
        }
    ),
)

This initializes the local vector database.

Key concepts:

Vector size = 384 → matches embedding dimension
Cosine similarity → measures angle between vectors (better for semantic similarity)
Qdrant internally uses advanced indexing (like HNSW), which allows fast nearest-neighbor search even with large datasets.

Data Insertion

def insert_data():
    edge_shard = create_fresh_shard()
    text = load_data()
    chunks = chunk_text(text)

    points = []

    for index, chunk in enumerate(chunks):
        stable_id = hashlib.md5(
            f"{COLLECTION}:{index}:{chunk}".encode("utf-8")
        ).hexdigest()

        point = Point(
            id=stable_id,
            vector={VECTOR_NAME: embed(chunk)},
            payload={
                "content": chunk,
                "chunk_index": index,
                "source": DATA_FILE
            }
        )
        points.append(point)

    edge_shard.update(UpdateOperation.upsert_points(points))

    print(f"Inserted {len(points)} structured sections")

Each chunk is stored as a Point, which contains:

ID → unique identifier (ensures no duplicates)
Vector → used for similarity search
Payload → actual text + metadata

The payload is important because after retrieval, we need to show meaningful text, not just vectors.

RAG Pipeline (`main.py`)

This file orchestrates the entire system: query processing, retrieval, and response generation.

MODEL_NAME = "BAAI/bge-small-en"
VECTOR_NAME = "text"
COLLECTION = "legal_docs"

TOP_K = 10
OPENROUTER_MODEL = "meta-llama/llama-3-8b-instruct"

BASE_PATH = "./qdrant_data"
SHARD_PATH = os.path.join(BASE_PATH, COLLECTION)

logging.basicConfig(level="INFO")
text_model = TextEmbedding(model_name=MODEL_NAME, cache_dir="./models")

def embed(text: str):
    return list(text_model.embed([text]))[0].tolist()

edge_shard: Optional[EdgeShard] = None

def get_edge_shard():
    global edge_shard
    if edge_shard:
        return edge_shard
    if not os.path.exists(SHARD_PATH):
        raise RuntimeError("Run ingest.py first.")

    edge_shard = EdgeShard.load(SHARD_PATH)
    return edge_shard

llm_client: Optional[OpenAI] = None

def get_llm_client():
    global llm_client
    if llm_client:
        return llm_client

    api_key = os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("Missing API key")

    llm_client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=api_key,
    )
    return llm_client

def generate_answer(messages):
    client = get_llm_client()
    response = client.chat.completions.create(
        model=OPENROUTER_MODEL,
        messages=messages
    )
    return response.choices[0].message.content

Embedding Model

The embedding model BAAI/bge-small-en converts both user queries and stored document chunks into dense vector representations. This allows the system to understand semantic meaning rather than relying on simple keyword matching. By ensuring that both queries and documents exist in the same vector space, the system can accurately measure similarity and retrieve the most relevant information.

The system uses meta-llama/llama-3-8b-instruct via OpenRouter for response generation. Unlike traditional chatbots, it does not rely on its internal knowledge alone but strictly uses the retrieved context from the database. This approach minimizes hallucination and ensures that responses remain accurate and grounded in the provided data.

Vector Search

results = shard.query(
    QueryRequest(
        query=Query.Nearest(vector, using=VECTOR_NAME),
        limit=TOP_K,
        with_payload=True
    )
)

When a user enters a query, it is first converted into a vector using the same embedding model used during ingestion. Qdrant then performs a nearest-neighbor search in the vector space using cosine similarity, identifying the closest vectors that represent semantically similar document chunks. The TOP_K parameter ensures that only the most relevant results are retrieved, while with_payload=True allows access to the original text associated with each vector.

This transforms the system from a basic chatbot into a context-aware retrieval system capable of understanding intent rather than just matching words.

Prompt Engineering

PROMPT_TEMPLATE = """You are a legal assistant.
Use ONLY the provided context.
"""

The prompt explicitly instructs the model to act as a legal assistant and restrict its responses to the retrieved context only. This prevents hallucination and ensures that the generated output is strictly based on the underlying documents. Additional instructions in the full prompt template guide the model on how to handle summaries, greetings, and unrelated queries, making the assistant more structured and predictable.

Answer Generation

response = client.chat.completions.create(
    model=OPENROUTER_MODEL,
    messages=messages
)

Once the relevant context is retrieved and structured into a prompt, the language model generates the final response. The LLM processes both the user query and the retrieved context together, allowing it to produce a coherent and meaningful answer. The retrieved context acts as a source of truth, while the LLM focuses on transforming that information into a clear and user-friendly response.

API Layer

@app.get("/ask", response_model=AskResponse)
def ask(q: str):
    return AskResponse(answer=rag(q))

By exposing the RAG pipeline through a simple REST endpoint, the system becomes easily accessible and integrable. When a request is sent to the /ask endpoint, the query is passed through the entire pipeline: embedding, retrieval, prompt construction, and answer generation. This abstraction allows the system to be seamlessly connected to web interfaces, mobile apps, or other services without modifying the core logic.

Summary

This project demonstrates how a complete retrieval-augmented AI assistant can be built without heavy cloud dependency by combining efficient local vector search with controlled language model generation. The core idea is straightforward yet powerful: convert documents into embeddings, retrieve the most relevant context using similarity search, and generate accurate responses grounded strictly in that context. This approach ensures faster performance, improved privacy, and greater reliability, especially for sensitive domains like legal or enterprise data. It also opens up opportunities to extend this system to various use cases such as internal knowledge bases, document assistants, or offline AI tools.

GitHub Project Link

The complete code, along with setup instructions and implementation details, is available on GitHub: AbhinayaPinreddy/qdrant_Edge_RAG

At Superteams.ai, we focus on building practical and scalable AI solutions tailored to real-world business needs. This project reflects that direction by showcasing how edge AI, vector search, and retrieval-based generation can be combined to create efficient, secure, and production-ready AI assistants.

Build an On-Device AI Assistant with RAG and Qdrant Edge

Why Qdrant Edge?

Project Structure

System Flow

Data Ingestion (ingest.py)

Embedding Model

Chunking Strategy

Creating the Qdrant Edge Shard

Data Insertion

RAG Pipeline (main.py)

Embedding Model

Vector Search

Prompt Engineering

Answer Generation

API Layer

Summary

GitHub Project Link

References

More from Academy

How to Build a Target Speaker Extraction System Using SpeechBrain, ECAPA-TDNN, and SepFormer

Benchmarking AI and Traditional Speech Enhancement Methods for Noise Reduction

How to Build a Sovereign AI Stack for Automated Parsing and Search through Medical Prescriptions

Your Car Is Getting a Brain: How to Build a Branded In-Car Voice Assistant

How to Build a Defect Detection AI Agent in Manufacturing (YOLO / CLIP + Qdrant Edge + DeepStream)

Data Ingestion (`ingest.py`)

RAG Pipeline (`main.py`)