Learn how hybrid caching boosts AI app performance - track hit rates, latency, ROI, and efficiency to cut response times under 100ms.
Latency in AI applications isn't like latency in traditional web services. You're not just querying a database, serving an API, or rendering a template, you're running inference on LLMs or LRMs with billions of parameters, generating tokens sequentially, and often chaining multiple LLM calls together. The physics of the problem are different, and so are the solutions.
In this deep-dive, we will go through different approaches you can take to reduce latency of your AI application. Typical modern enterprise-grade AI applications are powered by data - and so, we will assume an AI stack that works on complex data (including vectors), and see how we can reduce latency in such a stack.
Before optimizing anything, you need to know where your latency actually comes from. Most AI applications have six major bottleneck categories:
Model inference time is the unavoidable cost of running your neural network. For large language models, this breaks down into time-to-first-token (TTFB) and tokens-per-second (throughput). TTFB includes prompt processing — the model ingesting your entire input context — while throughput determines how fast tokens stream back. A 70B parameter model might give you 20-30 tokens/second on a good GPU, but that first token could take 2-3 seconds (or longer) if your prompt is long.
Query generation latency is where RAG applications start bleeding time. When a user asks "What were Q3 sales for enterprise customers in Singapore?", you can't just pass this to your vector database. You need to:
That's 300-850ms before you've retrieved anything. For complex queries requiring joins across structured and unstructured data, this balloons to 1-2 seconds. And here's the killer: if your initial query generation is wrong (due to LLM hallucination), you'll need to retry, doubling or tripling this time.
Retrieval time from hybrid data stores is more complex than simple vector similarity search. Real-world AI applications query:
The problem isn't any single query — it's the orchestration. A question like "Show me high-value customers who haven't purchased in 90 days and whose support tickets mention billing issues" requires:
That's 210ms minimum for retrieval alone. Under load, when connection pools are saturated and databases are handling concurrent queries, these numbers can be 5x.
Data fusion and re-ranking overhead happens after retrieval but before LLM inference. You've pulled results from multiple sources — now you need to:
This adds another 160-730ms. Skip it, and your LLM ends up getting irrelevant context, producing worse results. Do it wrong, and you're wasting tokens on low-quality context.
Network latency matters more than you think. If your model runs in ap-south-1 but your users are in Singapore, you're paying 200ms round-trip before any compute happens. With streaming responses, this manifests as delay before the first token appears. With non-streaming, it's pure dead time. But network latency also hits you between services: your API gateway to your query service (20ms), your query service to your vector database (10ms), your vector database to your object storage for document retrieval (40ms). Each hop adds up.
Serialization overhead hits you when you're marshaling complex data structures. JSON encoding/decoding, protobuf serialization, or worse — pickle in Python — can add milliseconds per request. This seems trivial until you're doing it thousands of times per second. With hybrid retrieval returning structured metadata plus document content plus embeddings plus relevance scores, your serialization payloads can hit megabytes. At that size, JSON parsing alone can take 50-100ms per request.
Queue waiting time emerges under load. GPUs can only process N requests concurrently. Request N+1 waits. If you are hosting your own LLMs, you will have to implement proper request batching and scheduling. If you use an external provider, you are dependent on their latency, and queueing strategies. The same applies to your retrieval layer — vector databases have concurrency limits, and structured databases have connection pool limits.
And that's when everything goes well. Often, you may have to retry queries, deal with LLM timeouts, or latency at embedding generation level. So, your final application may end up being slower than you imagined.
The good news? Most of this is cacheable.
The fastest code is code that never runs. For AI applications, this means caching, and caching at scale requires rethinking your architecture.
Traditional caching strategies assume deterministic outputs. You ask for user ID 12345's profile, you get the same JSON every time until something changes.
AI applications break this assumption. LLMs are non-deterministic by default. The same prompt can yield different responses. For example, if a user asks "Summarize our Q3 performance" three times, they might get three slightly different summaries—different word choices, different emphasis, different ordering of facts.
This non-determinism makes developers hesitant to cache. They worry about serving stale responses or missing nuanced changes in model output. But here's what most developers miss: for the vast majority of use cases, you don't need that non-determinism.
Consider a RAG application answering support questions. When someone asks "How do I reset my password?", your system:
That's almost 6 seconds. But if 40% of your support queries are variations of the same 50 questions, you're doing 95% wasted work.
The key insight: you can cache at multiple stages of your pipeline, not just the final response.
In a production application we recently built for an enterprise, we had built a news search agent. By simply adding caching in a strategic way, we reduced costs and latency by 1/10th in 90% of the queries. Let’s first look at how to set up a caching infrastructure.
Redis is your best friend for low-latency caching at scale. There are other systems, but redis works great in cluster-mode, can be sharded, and is highly reliable. Redis’ in-memory storage gives you sub-millisecond reads. Distributed deployment means you can handle millions of keys without breaking a sweat. Here's how to architect it properly.
You have three main options: standalone Redis, Redis Sentinel, or Redis Cluster.
Standalone works for development and small-scale production. One instance, simple configuration, easy to reason about. But you have no redundancy. When that instance goes down, your cache disappears and your AI backend gets hammered with cold traffic.
Redis Sentinel gives you high availability through automatic failover. You run multiple Redis replicas, and Sentinel monitors them. When your primary fails, Sentinel promotes a replica. Your cache stays available, but you're still limited by the memory of a single machine for your working set.
Redis Cluster is what you want for serious AI workloads. It shards your data across multiple nodes, giving you horizontal scalability. You can store terabytes of cached responses across dozens of machines. Each node handles a subset of hash slots (16,384 total), and Redis automatically routes requests to the right node.
Here's a minimal cluster setup:
# Three master nodes, three replicas
redis-server --port 7000 --cluster-enabled yes --cluster-config-file nodes-7000.conf --cluster-node-timeout 5000 --appendonly yes
redis-server --port 7001 --cluster-enabled yes --cluster-config-file nodes-7001.conf --cluster-node-timeout 5000 --appendonly yes
redis-server --port 7002 --cluster-enabled yes --cluster-config-file nodes-7002.conf --cluster-node-timeout 5000 --appendonly yes
Here’s how you create the cluster:
# Create the cluster
redis-cli --cluster create 127.0.0.1:7000 127.0.0.1:7001 127.0.0.1:7002 --cluster-replicas 1
In production, you'd deploy these across multiple availability zones, use proper monitoring, and tune your TCP keepalive settings to detect dead connections faster.
Instead of only caching final responses, you want to cache at every expensive step. This gives you partial cache hits even when the full query is novel.
User queries often have similar intents even when phrased differently. Cache the parsed intent and generated structured queries:
import hashlib
import json
async def cached_query_generation(user_query: str):
# Normalize query for better cache hits
normalized = user_query.lower().strip()
cache_key = f"qgen:{hashlib.sha256(normalized.encode()).hexdigest()}"
# Check cache
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Generate queries
result = await generate_queries(user_query)
# result = {
# "vector_query": embedding,
# "sql_filters": "customer_tier='enterprise' AND region='EMEA'",
# "intent": "sales_data",
# "time_range": {"start": "2024-07-01", "end": "2024-09-30"}
# }
# Cache for 1 hour
await redis_client.setex(
cache_key,
3600,
json.dumps(result, default=str)
)
return result
This alone can save you 300-500ms on cache hits. For high-traffic applications, that's 30-50% of your query generation load eliminated.
Cache the actual retrieval results based on the generated queries. This is trickier because you're caching based on structured filters plus vector similarity:
async def cached_retrieval(query_params: dict):
# Create composite cache key from all query parameters
key_parts = [
f"v:{hashlib.sha256(query_params['vector_query'].tobytes()).hexdigest()[:16]}",
f"sql:{hashlib.sha256(query_params['sql_filters'].encode()).hexdigest()[:16]}",
f"t:{query_params['time_range']['start']}_{query_params['time_range']['end']}"
]
cache_key = f"retrieval:{'_'.join(key_parts)}"
# Check cache
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Execute parallel retrieval
vector_results, structured_results, graph_results = await asyncio.gather(
vector_db.search(query_params['vector_query'], top_k=20),
postgres.query(query_params['sql_filters']),
neo4j.traverse(query_params.get('graph_query'))
)
# Combine results
results = {
"documents": vector_results,
"structured_data": structured_results,
"relationships": graph_results
}
# Cache for 10 minutes (retrieval results can be stale-ish)
await redis_client.setex(
cache_key,
600,
json.dumps(results, default=str)
)
return results
Retrieval caching is powerful because:
After retrieval, you perform data fusion, re-ranking, and context construction. This is CPU-intensive and deterministic given the same inputs:
async def cached_context_fusion(retrieval_results: dict, user_query: str):
# Cache key combines retrieval results + user query
results_hash = hashlib.sha256(
json.dumps(retrieval_results, sort_keys=True, default=str).encode()
).hexdigest()
query_hash = hashlib.sha256(user_query.encode()).hexdigest()
cache_key = f"context:{results_hash[:16]}_{query_hash[:16]}"
cached = await redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Perform expensive operations
deduplicated = deduplicate_documents(retrieval_results['documents'])
reranked = await rerank_with_cross_encoder(deduplicated, user_query)
filtered = apply_business_rules(reranked, retrieval_results['structured_data'])
context = construct_prompt_context(filtered, max_tokens=4000)
result = {
"context": context,
"sources": [doc['id'] for doc in filtered[:10]]
}
# Cache for 30 minutes
await redis_client.setex(cache_key, 1800, json.dumps(result))
return result
This saves 200-500ms of CPU-bound processing on every cache hit.
Finally, cache the complete LLM response. But use semantic similarity matching to handle query variations:
from redis.commands.search.field import VectorField, TextField
from redis.commands.search.indexdef import IndexDefinition, IndexType
# Create semantic cache index
schema = (
VectorField("embedding", "HNSW", {
"TYPE": "FLOAT32",
"DIM": 1536,
"DISTANCE_METRIC": "COSINE"
}),
TextField("query"),
TextField("response"),
TextField("sources")
)
redis_client.ft("semantic_cache").create_index(
schema,
definition=IndexDefinition(prefix=["response:"], index_type=IndexType.HASH)
)
async def semantic_cache_lookup(query: str, embedding: np.ndarray, threshold: float = 0.96):
# Search for semantically similar cached queries
results = redis_client.ft("semantic_cache").search(
f"*=>[KNN 3 @embedding $vec AS score]",
query_params={"vec": embedding.tobytes()}
)
for doc in results.docs:
similarity = float(doc.score)
if similarity >= threshold:
return {
"response": doc.response,
"sources": json.loads(doc.sources),
"cache_hit": True,
"similarity": similarity
}
return None
Here's how it all comes together:
async def handle_query(user_query: str):
# Stage 1: Query generation (with cache)
query_params = await cached_query_generation(user_query)
# Saves ~400ms on hit
# Stage 2: Retrieval (with cache)
retrieval_results = await cached_retrieval(query_params)
# Saves ~200ms on hit
# Stage 3: Context fusion (with cache)
context = await cached_context_fusion(retrieval_results, user_query)
# Saves ~300ms on hit
# Stage 4: Check semantic cache for final response
query_embedding = await embed_query(user_query)
cached_response = await semantic_cache_lookup(user_query, query_embedding)
if cached_response:
# Total cache hit - return immediately
# Saved ~5900ms (everything except the cache lookup itself)
return cached_response
# Partial cache hit - only need to run LLM
# Already saved ~900ms from stages 1-3
prompt = construct_prompt(context['context'], user_query)
response = await llm_inference(prompt)
# Cache the final response
await cache_final_response(user_query, query_embedding, response, context['sources'])
return {
"response": response,
"sources": context['sources'],
"cache_hit": False
}
With this architecture:
Your effective latency distribution shifts dramatically. Instead of p50 = 6000ms, you get p50 = 50ms with good cache warming.
Embedding generation is sneaky and expensive. A typical Bedrock embedding model takes 50-150ms per query. Larger models like OpenAI's text-embedding-3-large take 100-200ms due to API latency. If you're re-embedding the same query repeatedly, you're wasting time and money.
class EmbeddingCache:
def __init__(self, redis_client):
self.redis = redis_client
self.local_cache = {} # L1 cache
async def get_or_create_embedding(self, text: str, model: str = "default"):
# L1: Check local memory
cache_key = f"{model}:{text}"
if cache_key in self.local_cache:
return self.local_cache[cache_key]
# L2: Check Redis
redis_key = f"emb:{model}:{hashlib.sha256(text.encode()).hexdigest()}"
cached = await self.redis.get(redis_key)
if cached:
embedding = np.frombuffer(cached, dtype=np.float32)
self.local_cache[cache_key] = embedding
return embedding
# Generate embedding
embedding = await generate_embedding(text, model)
# Store in both caches
self.local_cache[cache_key] = embedding
await self.redis.setex(
redis_key,
86400 * 7, # 1 week TTL
embedding.tobytes()
)
return embedding
Embedding caches have very long TTLs (days or weeks) because the embedding for a given text doesn't change unless you change models.
When you need to embed multiple texts (e.g., for re-ranking), batch them efficiently while leveraging the cache:
async def batch_embed_with_cache(texts: list[str], model: str = "default"):
results = {}
uncached_texts = []
uncached_indices = []
# Check cache for each text
for i, text in enumerate(texts):
embedding = await embedding_cache.get_or_create_embedding(text, model)
if embedding is not None:
results[i] = embedding
else:
uncached_texts.append(text)
uncached_indices.append(i)
# Batch embed uncached texts
if uncached_texts:
new_embeddings = await batch_generate_embeddings(uncached_texts, model)
for idx, embedding in zip(uncached_indices, new_embeddings):
results[idx] = embedding
# Cache new embeddings
await embedding_cache.store(texts[idx], embedding, model)
# Return in original order
return [results[i] for i in range(len(texts))]
This reduces embedding costs dramatically. For a re-ranking operation over 50 documents, you might only need to embed 5-10 new ones, cutting latency from 500ms to 100ms.
Your vector and structured databases are often the biggest hidden latency contributors. Even when your LLM is fast, the retrieval layer can bottleneck you — especially under concurrent load. The trick is to bring frequently accessed data closer to the application and precompute what you can.
A common anti-pattern in AI apps is hammering the same primary database for every query. Instead, set up read replicas for high-read workloads (like metadata retrieval or semantic joins). PostgreSQL, for instance, supports asynchronous replication out of the box. Combined with a connection pooler like PgBouncer, you can handle thousands of concurrent lightweight queries without hitting connection limits.
Then configure your retrieval logic to route low-priority or cacheable queries to replicas. You get instant horizontal scaling and lower contention.
AI applications often run the same “fusion” query patterns repeatedly — like joining structured filters with document embeddings. Instead of executing complex joins each time, use materialized views.
A materialized view is a precomputed result set stored physically, which can be refreshed on a schedule or triggered by updates.
For example:
CREATE MATERIALIZED VIEW customer_activity_summary AS
SELECT
c.id AS customer_id,
c.tier,
COUNT(o.id) AS order_count,
MAX(o.created_at) AS last_purchase,
AVG(v.similarity_score) AS avg_semantic_match
FROM customers c
JOIN orders o ON o.customer_id = c.id
JOIN vector_matches v ON v.customer_id = c.id
GROUP BY c.id, c.tier;
Now, your AI layer just runs:
SELECT * FROM customer_activity_summary WHERE tier = 'enterprise';
Instant 100–200ms saved per query, plus reduced load on your live tables.
In modern cloud-native setups, you can even auto-refresh these using event-driven triggers or cron-based refreshes (REFRESH MATERIALIZED VIEW CONCURRENTLY ...).
For vector data, you can maintain partial materialized views or precomputed ANN indexes for your top queries (like “high-value customers” or “open tickets”). If your application tracks query frequency, you can periodically snapshot these hot queries and pre-index their results.
This creates a multi-tier retrieval strategy:
We have found that there’s opportunity for caching in most AI applications - even where you think that the AI response is highly tailored to the user. Consider these real-world scenarios:
Support chatbots: When someone asks "How do I reset my password?", the answer doesn't need to be creatively varied. Users want accurate, consistent information. Serving a cached response from 10 minutes ago is perfectly fine — even desirable, because it reduces response time from 6 seconds to 50 milliseconds.
Document Q&A systems: "What is our refund policy?" has one correct answer based on your documentation. The LLM might phrase it slightly differently each time, but the semantic content is identical. Users don't care about stylistic variation — they want the information fast.
Data analysis queries: "What were our top-selling products in Q3?" produces a factual answer based on structured data. The underlying data doesn't change minute-to-minute. A 10-minute cache TTL is perfectly reasonable, and you've just eliminated 95% of your most expensive operations.
Code generation: "Write a function to validate email addresses in Python" has many valid implementations, but developers don't need a novel solution every time. They want working code quickly. Caching common requests is a win.
The key insight is understanding when non-determinism adds value versus when it's just computational waste. For creative writing, brainstorming, or generating multiple alternative solutions, non-determinism is valuable. For factual retrieval, data queries, and standard procedures, it's not.
Let's quantify what caching means for your infrastructure costs and user experience:
Without caching:
With 70% cache hit rate:
The math is brutal: every cache hit saves you money and makes users happier. But the benefits compound beyond simple cost savings:
Reduced infrastructure needs: With 70% cache hit rate, you need 1/3 the GPU capacity to handle the same load. That's the difference between 6 A100s and 2 A100s—roughly $50,000/year in cloud costs.
Improved tail latencies: Cache hits are consistently fast. This means your p95 and p99 latencies drop dramatically. Instead of having some requests take 10+ seconds under load, your worst case becomes 6 seconds (uncached) while most requests complete in under 100ms.
Better user retention: Users perceive applications with sub-second response times as "instant." The psychological difference between 50ms and 6000ms is massive. Studies show conversion rates drop 7% for every additional second of latency. A 5-second improvement can mean 35% more conversions.
Graceful degradation: When your backend services have issues, cached responses keep your application responsive. Your cache becomes a buffer that maintains user experience even when your LLM provider has an outage or your vector database is struggling.
Not all queries are equally cacheable. You have to use a judgement call based on your system architecture to figure out which queries should be cached. Understanding your cache hit distribution is critical for optimization:
Power law distribution: In most applications, 20% of unique queries account for 80% of traffic. These high-frequency queries should have near-100% cache hit rates. A "How do I reset my password?" query might be asked 50 times per day, but it's essentially the same question.
Long tail: The remaining 80% of unique queries might each be asked only once or twice. These queries will always miss the cache on first request, but semantic caching can still help if queries are similar even when not identical.
Temporal patterns: Query distributions change throughout the day, week, and year. Morning queries differ from evening queries. Monday patterns differ from Friday patterns. Q4 queries differ from Q1 queries. Your cache warming strategy needs to account for these patterns.
User cohorts: Different user types have different query patterns. Admin users ask about system metrics and user management. End users ask about product features and how-to questions. Your cache should optimize for the query distribution of your largest user cohort.
While caching is a powerful methodology, there are legitimate cases where caching is the wrong choice:
Personalized responses: If your LLM responses depend on user-specific context that changes frequently (recent browsing history, real-time recommendations), caching becomes complex. You'd need user-specific cache keys, which fragments your cache and reduces hit rates.
Security-sensitive queries: For queries involving access control or sensitive data, you must be extremely careful about cache poisoning. One user's query shouldn't leak data to another user through a shared cache.
Truly time-sensitive data: For queries about stock prices, sports scores, or breaking news, staleness matters. You'd need very short TTLs (seconds, not minutes), which reduces the effectiveness of caching.
Creative or exploratory requests: When users explicitly want variety—"Give me 5 different marketing slogans" or "Brainstorm product ideas"—serving identical cached responses defeats the purpose.
The solution isn't to avoid caching entirely — it's to be selective about what you cache and how long you cache it. Use cache tags to segregate different types of queries, and implement smart TTL logic based on query classification:
def classify_query_and_get_ttl(query: str, intent: str) -> tuple[bool, int]:
"""
Returns (should_cache, ttl_seconds)
"""
# Don't cache creative or personalized queries
if intent in ['creative_writing', 'brainstorming', 'personalized_recommendation']:
return (False, 0)
# Short TTL for time-sensitive data
if intent in ['stock_prices', 'sports_scores', 'breaking_news']:
return (True, 60) # 1 minute
# Medium TTL for operational data
if intent in ['user_metrics', 'system_status', 'recent_activity']:
return (True, 300) # 5 minutes
# Long TTL for factual/procedural content
if intent in ['documentation', 'how_to', 'policy_questions']:
return (True, 3600) # 1 hour
# Very long TTL for historical data
if intent in ['historical_analysis', 'archived_content']:
return (True, 86400) # 24 hours
# Default: cache with medium TTL
return (True, 600) # 10 minutes
Here's the sophisticated approach: even when you can't cache the final response, you can cache intermediate results. This is where the multi-stage caching architecture shines:
Scenario: A user asks "What's our current runway based on the latest burn rate?"
This query is partially time-sensitive. The burn rate changes daily, so you can't cache the final response for more than a few hours. But you can cache:
Only the current month's data needs real-time retrieval. You've just reduced your latency from 6 seconds to 2 seconds even though the full response isn't cacheable.
async def handle_financial_query(query: str):
# Cache query generation (structure doesn't change)
sql_query = await cached_query_generation(query) # 24hr TTL
# Split into historical (cacheable) and current (fresh) data
historical_data = await cached_historical_retrieval(sql_query) # Indefinite TTL
current_data = await fresh_data_retrieval(sql_query) # No cache
# Combine data
combined_data = merge_timeseries(historical_data, current_data)
# Use cached response template
template = await cached_response_template(query) # 1 week TTL
# Only the final formatting requires LLM (much faster than full generation)
response = await quick_format_with_template(template, combined_data)
return response
This hybrid caching strategy gives you the best of both worlds: freshness where it matters, speed where it doesn't. Your effective cache hit rate becomes much higher when you measure it at the operation level rather than the request level.
You need metrics to know if your caching strategy is working:
Cache hit rate by stage: Track hits/misses for each stage of your pipeline separately. You might have an 80% hit rate on query generation but only 40% on final responses. This tells you where to focus optimization efforts.
Latency reduction from caching: Measure p50, p95, p99 latencies for cached vs uncached requests. Your cached requests should be 10-100x faster.
Cost savings: Track inference costs with and without caching. Calculate ROI on your Redis infrastructure vs the LLM costs you're avoiding.
Cache efficiency ratio: Bytes stored vs bytes served. If you're storing gigabytes of cached data but only serving a small fraction repeatedly, your cache is poorly sized or your TTLs are too long.
Staleness incidents: Track how often users complain about outdated responses. This helps you calibrate your TTLs—too long and responses are stale, too short and you're missing optimization opportunities.
The goal isn't 100% cache hit rate—that's impossible and undesirable. The goal is maximizing the product of hit rate and value per hit. A 60% cache hit rate on expensive queries is far more valuable than a 90% cache hit rate on cheap queries.
With proper caching architecture, you can serve the majority of requests in under 100ms while maintaining accuracy and freshness where it matters. That's the difference between an AI application that feels sluggish and one that feels magical.
As we’ve discussed numerous times before, the AI stack adds an additional layer of complexity — not just computationally, but architecturally. Latency doesn’t emerge from a single source; it’s a sum of inefficiencies across inference, retrieval, data orchestration, and network communication. The key is to approach it through first principles: understand where every millisecond goes, question every serialization, and measure before optimizing.
By combining classical distributed systems wisdom (caching, batching, locality, async I/O, and connection pooling) with AI-era innovations (vector indexing, retrieval fusion, model quantization, and semantic caching), you can achieve near real-time performance even with massive models.
Ultimately, reducing latency isn’t just a technical pursuit — it’s a user experience and business strategy. Faster AI systems feel smarter, more responsive, and more trustworthy. And when every 100ms counts in retaining user flow or closing enterprise adoption gaps, thoughtful latency engineering becomes a competitive moat.
At Superteams.ai, we help businesses understand AI through workshops, demos and AI sprints. If you are planning to bring AI into your business stack or want to simply learn more about AI’s ROI for your business, feel free to get in touch for a 30-min consultation call.