Academy
Updated on
Apr 21, 2026

Build an On-Device AI Assistant with RAG and Qdrant Edge

This project builds a local AI assistant using Qdrant Edge and RAG, enabling fast, private, and accurate responses by combining vector search with LLMs without heavy cloud use.

Build an On-Device AI Assistant with RAG and Qdrant Edge
Ready to ship your own agentic-AI solution in 30 days? Book a free strategy call now.

Modern AI systems are powerful, but most of them rely heavily on cloud infrastructure. Every query is sent over the internet, processed remotely, and returned back to the user. While this works fine, the process introduces latency, privacy risks, and dependency on external services. In sensitive domains like legal or personal data, this becomes a serious limitation.

This project demonstrates how to build a fully functional AI assistant that runs locally, using Qdrant Edge for vector search and a lightweight embedding model. The system retrieves relevant information from local documents and uses an LLM only for generating responses, ensuring both efficiency and control.




Why Qdrant Edge ?

Vector databases are the backbone of any RAG system. Popular options include Pinecone, Weaviate, FAISS, and Qdrant. However, most of them either:

  • Require a cloud setup (Pinecone)
  • Need heavy configuration (Weaviate)
  • Lack structured payload support (FAISS)

Qdrant Edge stands out because:

  • It is designed specifically for on-device usage, unlike cloud-first solutions
  • It provides persistent storage, unlike FAISS which is mostly in-memory
  • It supports payload filtering + metadata, making it ideal for structured documents
  • It is lightweight yet production-ready, unlike experimental local solutions
  • It uses efficient indexing (HNSW) for fast similarity search

In simple terms:

Qdrant Edge gives you the power of a production-grade vector database, but small enough to run on your laptop.




Project Structure

project/
├── models/           # Embedding model cache
├── qdrant_data/      # Local vector DB storage
├── data.txt          # Legal dataset
├── ingest.py         # Data → Embeddings → Qdrant
├── main.py           # API + RAG pipeline
├── requirements.txt
└── README.md

This structure separates concerns clearly:

  • Data layer → data.txt
  • Processing layer → ingest.py
  • Serving layer → main.py



System Flow

This is a classic RAG pipeline, where retrieval improves the accuracy of generation.

Data Ingestion (ingest.py)

The ingestion pipeline is responsible for converting raw text into a searchable vector format. This is one of the most critical parts of the system because the quality of retrieval directly depends on how well the data is prepared.




Embedding Model

text_model = TextEmbedding(
    model_name="BAAI/bge-small-en",
    cache_dir="./models"
)

Embeddings are the foundation of semantic search. Instead of matching keywords, the system converts text into dense numerical vectors in a high-dimensional space.

The chosen model, BGE-small, is particularly effective because:

  • It balances accuracy and speed
  • Produces 384-dimensional vectors, which are compact yet expressive
  • Works efficiently on CPU, making it ideal for edge devices

When a sentence is passed through this model, it is transformed into a vector such that similar meanings lie closer in vector space. This enables semantic retrieval instead of simple keyword matching.

Chunking Strategy

def chunk_text(text: str):
    sections = re.split(r"\n\d+\.\s", text)

    chunks = []
    for sec in sections:
        sec = sec.strip()
        if sec:
            chunks.append(sec)

    return chunks

Chunking is essential because LLMs and embedding models perform better on smaller, meaningful pieces of text rather than entire documents.

Here, the document is split based on numbered sections (e.g., "1.", "2."). This ensures:

  • Each chunk represents a logical legal section
  • Context remains structured and meaningful
  • Retrieval becomes more precise and relevant

Poor chunking can lead to irrelevant or incomplete answers, so this step is crucial.

Creating Qdrant Edge Shard

EdgeShard.create(
    SHARD_PATH,
    EdgeConfig(
        vectors={
            VECTOR_NAME: EdgeVectorParams(size=384, distance=Distance.Cosine)
        }
    ),
)

This initializes the local vector database.

Key concepts:

  • Vector size = 384 → matches embedding dimension
  • Cosine similarity → measures angle between vectors (better for semantic similarity)

Qdrant internally uses advanced indexing (like HNSW), which allows fast nearest-neighbor search even with large datasets.




Data Insertion

def insert_data():
    edge_shard = create_fresh_shard()
    text = load_data()
    chunks = chunk_text(text)

    points = []

    for index, chunk in enumerate(chunks):
        stable_id = hashlib.md5(
            f"{COLLECTION}:{index}:{chunk}".encode("utf-8")
        ).hexdigest()

        point = Point(
            id=stable_id,
            vector={VECTOR_NAME: embed(chunk)},
            payload={
                "content": chunk,
                "chunk_index": index,
                "source": DATA_FILE
            }
        )
        points.append(point)

    edge_shard.update(UpdateOperation.upsert_points(points))

    print(f"Inserted {len(points)} structured sections")

Each chunk is stored as a Point, which contains:

  • ID → unique identifier (ensures no duplicates)
  • Vector → used for similarity search
  • Payload → actual text + metadata

The payload is important because after retrieval, we need to show meaningful text, not just vectors.




RAG Pipeline (main.py)

This file orchestrates the entire system: query processing, retrieval, and response generation.

MODEL_NAME = "BAAI/bge-small-en"
VECTOR_NAME = "text"
COLLECTION = "legal_docs"

TOP_K = 10
OPENROUTER_MODEL = "meta-llama/llama-3-8b-instruct"

BASE_PATH = "./qdrant_data"
SHARD_PATH = os.path.join(BASE_PATH, COLLECTION)

logging.basicConfig(level="INFO")
text_model = TextEmbedding(model_name=MODEL_NAME, cache_dir="./models")

def embed(text: str):
    return list(text_model.embed([text]))[0].tolist()
edge_shard: Optional[EdgeShard] = None
def get_edge_shard():
    global edge_shard
    if edge_shard:
        return edge_shard
    if not os.path.exists(SHARD_PATH):
        raise RuntimeError("Run ingest.py first.")

    edge_shard = EdgeShard.load(SHARD_PATH)
    return edge_shard
llm_client: Optional[OpenAI] = None
def get_llm_client():
    global llm_client
    if llm_client:
        return llm_client

    api_key = os.getenv("OPENROUTER_API_KEY") or os.getenv("OPENAI_API_KEY")
    if not api_key:
        raise RuntimeError("Missing API key")

    llm_client = OpenAI(
        base_url="https://openrouter.ai/api/v1",
        api_key=api_key,
    )
    return llm_client

def generate_answer(messages):
    client = get_llm_client()
    response = client.chat.completions.create(
        model=OPENROUTER_MODEL,
        messages=messages
    )
    return response.choices[0].message.content



Model Explanation

This file acts as the central orchestrator of the system, managing how a user query is processed, how relevant information is retrieved, and how the final response is generated. It connects all components of the RAG pipeline to ensure smooth and accurate functioning.

The embedding model used is BAAI/bge-small-en, which converts both user queries and stored document chunks into dense vector representations. This allows the system to understand semantic meaning rather than relying on simple keyword matching. By ensuring that both queries and documents exist in the same vector space, the system can accurately measure similarity and retrieve the most relevant information.

  • Converts text into meaningful vector representations
  • Enables semantic (meaning-based) search instead of keyword search
  • Ensures query and document alignment in the same vector space

The system uses the following LLM for response generation:

OPENROUTER_MODEL = "meta-llama/llama-3-8b-instruct"

This language model is responsible for generating the final answer in a human-readable format. Unlike traditional chatbots, it does not rely on its internal knowledge alone but strictly uses the retrieved context from the database. This approach minimizes hallucination and ensures that responses remain accurate and grounded in the provided data.

  • Generates structured, human-readable answers
  • Uses only retrieved context instead of external knowledge
  • Reduces hallucination and improves reliability

Vector Search

results = shard.query(
    QueryRequest(
        query=Query.Nearest(vector, using=VECTOR_NAME),
        limit=TOP_K,
        with_payload=True
    )
)

This step forms the core of the retrieval system, where the user query is matched against stored document embeddings to find the most relevant information. Instead of relying on keyword matching, the system uses semantic similarity, which allows it to understand the meaning behind the query.

This step forms the core of the retrieval system, where the user query is matched against stored document embeddings to find the most relevant information. Instead of relying on keyword matching, the system uses semantic similarity, which allows it to understand the meaning behind the query.

When a user enters a query, it is first converted into a vector using the same embedding model that was used during ingestion. Qdrant then performs a nearest-neighbor search in the vector space using cosine similarity, identifying the closest vectors that represent semantically similar document chunks. The TOP_K parameter ensures that only the most relevant results are retrieved, while with_payload=True allows access to the original text associated with each vector.

  • Converts query into a vector representation
  • Uses cosine similarity to find closest matches
  • Returns top-K relevant chunks with actual content

This process transforms the system from a basic chatbot into a context-aware retrieval system capable of understanding intent rather than just matching words.

Prompt Engineering

PROMPT_TEMPLATE = """You are a legal assistant.

Use ONLY the provided context.

"""

Prompt engineering plays a crucial role in controlling the behavior of the language model. Since LLMs are inherently generative and may introduce information beyond the provided data, a well-designed prompt ensures that responses remain accurate and grounded.

In this system, the prompt explicitly instructs the model to act as a legal assistant and restrict its responses to the retrieved context only. This prevents hallucination and ensures that the generated output is strictly based on the underlying documents. Additional instructions in the full prompt template guide the model on how to handle summaries, greetings, and unrelated queries, making the assistant more structured and predictable.

  • Keeps responses grounded in retrieved context
  • Prevents hallucination and external assumptions
  • Enforces domain-specific behavior (legal assistant)

This step is essential for maintaining trustworthiness and reliability in the system.

Answer Generation

response = client.chat.completions.create(
    model=OPENROUTER_MODEL,
    messages=messages
)

Once the relevant context is retrieved and structured into a prompt, the language model generates the final response. The LLM processes both the user query and the retrieved context together, allowing it to produce a coherent and meaningful answer.

Unlike traditional AI systems that rely purely on pre-trained knowledge, this approach ensures that the model’s output is grounded in real data. The retrieved context acts as a source of truth, while the LLM focuses on transforming that information into a clear and user-friendly response. This combination of retrieval and generation significantly improves accuracy and reduces the risk of incorrect or fabricated answers.

  • Combines retrieved context with user query
  • Generates structured, human-readable responses
  • Improves accuracy by grounding outputs in real data



API Layer

@app.get("/ask", response_model=AskResponse)
def ask(q: str):
    return AskResponse(answer=rag(q))

The API layer serves as the interface between the backend system and external users or applications. By exposing the RAG pipeline through a simple REST endpoint, the system becomes easily accessible and integrable.

When a request is sent to the /ask endpoint, the query is passed through the entire pipeline: embedding, retrieval, prompt construction, and answer generation. The final response is then returned in a structured format. This abstraction allows the system to be seamlessly connected to web interfaces, mobile apps, or other services without modifying the core logic.

  • Provides a simple interface to access the RAG system
  • Enables integration with frontend or external applications
  • Supports local deployment and scalability

This design makes the system modular and extensible, allowing it to evolve into a full-scale production application.




Example Outputs from the Assistant

The following examples demonstrate how the system responds to different types of user queries through the built-in interface. The assistant is able to understand the intent of the query, retrieve relevant context from the local database, and generate accurate responses grounded in the document.




Summary

This project demonstrates how a complete retrieval-augmented AI assistant can be built without heavy cloud dependency by combining efficient local vector search with controlled language model generation. The core idea is straightforward yet powerful: convert documents into embeddings, retrieve the most relevant context using similarity search, and generate accurate responses grounded strictly in that context. This approach ensures faster performance, improved privacy, and greater reliability, especially for sensitive domains like legal or enterprise data. It also opens up opportunities to extend this system to various use cases such as internal knowledge bases, document assistants, or offline AI tools.




Github Project Link

The complete code, along with setup instructions and implementation details, is available on:
https://github.com/AbhinayaPinreddy/qdrant_Edge_RAG

At Superteams.ai, we focus on building practical and scalable AI solutions tailored to real-world business needs. This project reflects that direction by showcasing how edge AI, vector search, and retrieval-based generation can be combined to create efficient, secure, and production-ready AI assistants.

References

Qdrant Docs: https://qdrant.tech/documentation/
Qdrant Edge GitHub: https://github.com/qdrant/qdrant

Authors

Want to Scale Your Business with AI Deployed on your Cloud?

Talk to our team and get a complementary agentic AI advisory session.

We use cookies to ensure the best experience on our website. Learn more