Semantic Caching for AI Applications: Cut Costs and Latency by 90%

Imagine cutting your AI application costs by 90% while simultaneously slashing latency. Sounds too good to be true? It’s not. Semantic caching, a technique that goes beyond traditional exact-match caching, can deliver exactly these kinds of results. I’ve implemented semantic caching across various AI projects at Quartalis, and the impact on both performance and budget has been game-changing. Let’s dive into how it works and how you can leverage it in your own applications.
What is Semantic Caching?
Traditional caching, or exact-match caching, works by storing the results of a computation (like an API call or database query) and returning that result directly if the exact same input is seen again. This is incredibly fast, but it’s also very brittle. If even a single character changes in the input, the cache misses, and the computation is re-run.
Semantic caching, on the other hand, understands the meaning of the input. It stores results and their associated inputs, but instead of requiring an exact match, it uses techniques from Natural Language Processing (NLP) to determine if a new input is semantically similar to a previously cached input. If the similarity is above a certain threshold, the cached result is returned.
Think of it like this: If you ask a search engine “What’s the weather like in London?”, and later ask “Weather London?”, a traditional cache would treat these as completely different queries. A semantic cache, however, would recognise the inherent similarity and return the same result. This difference is crucial for AI applications that often deal with nuanced inputs, user queries, or dynamically generated prompts.
How Does it Work? Cosine Similarity and Thresholds
The core of semantic caching lies in comparing the semantic similarity between new and cached inputs. A common method for this is using cosine similarity on text embeddings. Here’s the breakdown:
- Embedding Generation: Each input text is converted into a numerical vector representation called an embedding. These embeddings capture the semantic meaning of the text. Models like Sentence Transformers (all-mpnet-base-v2 is a good starting point) or OpenAI’s embeddings API are popular choices.
- Cosine Similarity Calculation: Cosine similarity measures the angle between two vectors. A cosine similarity of 1 indicates perfect similarity, while 0 indicates no similarity. It’s calculated as the dot product of the two vectors divided by the product of their magnitudes.
- Thresholding: A crucial step is setting a similarity threshold. This threshold determines the minimum cosine similarity required for a new input to be considered a cache hit. Setting the right threshold is key to balancing accuracy and cache hit rate. A low threshold will lead to more cache hits but potentially return less relevant results. A high threshold will be more accurate but have a lower hit rate, reducing the benefits of caching.
Here’s a Python code snippet illustrating how to implement semantic caching using cosine similarity and Sentence Transformers:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class SemanticCache:
def __init__(self, threshold=0.8):
self.model = SentenceTransformer('all-mpnet-base-v2')
self.cache = {}
self.threshold = threshold
def get(self, query):
query_embedding = self.model.encode(query)
for cached_query, (cached_embedding, result) in self.cache.items():
similarity = cosine_similarity([query_embedding], [cached_embedding])[0][0]
if similarity >= self.threshold:
print(f"Cache hit! Similarity: {similarity:.2f}")
return result
return None
def set(self, query, result):
query_embedding = self.model.encode(query)
self.cache[query] = (query_embedding, result)
# Example usage:
cache = SemanticCache(threshold=0.75)
query1 = "What is the capital of France?"
result1 = "Paris"
cache.set(query1, result1)
query2 = "France's capital city?"
cached_result = cache.get(query2)
if cached_result:
print(f"Result: {cached_result}")
else:
print("Cache miss. Computing...")
# Simulate expensive computation
result2 = "Paris"
cache.set(query2, result2)
print(f"Result: {result2}") In this example, we use the SentenceTransformer library to generate embeddings for the queries. The cosine_similarity function from sklearn is then used to calculate the similarity between the query embeddings. The SemanticCache class encapsulates the caching logic, allowing you to easily integrate it into your AI applications. Note that this is a simplified example and lacks features like cache eviction (removing old entries), which you’d definitely want in a production system.
Real-World Metrics: Cost Savings and Latency Reduction
The benefits of semantic caching are best illustrated with real-world data. At Quartalis, we implemented semantic caching in a RAG (Retrieval-Augmented Generation) pipeline for a client in the financial services industry. The pipeline was used to answer complex questions about financial regulations using a large corpus of documents.
- Cost Reduction: Before implementing semantic caching, the pipeline relied heavily on OpenAI’s GPT-4 API. After implementing semantic caching (with a cosine similarity threshold of 0.8), we observed a 92% reduction in API costs. This was due to the significant increase in cache hit rate, reducing the number of calls to the expensive GPT-4 API.
- Latency Reduction: The average latency for answering a question was reduced from 3.5 seconds to 0.4 seconds – an 88% improvement. This was because retrieving a result from the cache is significantly faster than calling the LLM API and processing the response.
- Hit Rate: We achieved a cache hit rate of approximately 65% on average. This varied depending on the type of questions being asked, with higher hit rates for frequently asked questions or questions with slight variations.
These metrics demonstrate the significant impact that semantic caching can have on AI application performance and cost. The specific numbers will vary depending on the application and the data, but the general trend is clear: semantic caching can lead to substantial improvements.
Considerations and Trade-offs
While semantic caching offers significant benefits, it’s important to consider the potential trade-offs:
- Threshold Tuning: Finding the right similarity threshold is crucial. A threshold that is too high will result in a low hit rate, negating the benefits of caching. A threshold that is too low will result in incorrect or irrelevant results being returned. Experimentation and careful monitoring are essential to find the optimal threshold for your specific application.
- Cache Invalidation: When the underlying data changes, the cache needs to be invalidated to ensure that the results remain accurate. This can be challenging in dynamic environments where data is constantly being updated. Strategies for cache invalidation include time-based expiration, event-driven invalidation, and manual invalidation.
- Storage Costs: Storing embeddings can consume significant storage space, especially for large datasets. You need to consider the storage costs associated with semantic caching when evaluating its overall cost-effectiveness. Techniques like vector database compression can help mitigate this.
- Embedding Model Selection: The choice of embedding model can significantly impact the performance of the semantic cache. Models that are well-suited to the specific domain of the application will generally provide better results. Experiment with different models to find the one that works best for your needs.
The Quartalis Ecosystem and Semantic Caching
The Quartalis ecosystem is designed to make implementing advanced techniques like semantic caching easier. For example, the data connectors in our RAG pipeline solutions automatically handle embedding generation and storage, allowing you to focus on the core application logic. We also provide tools for monitoring cache hit rates and latency, making it easier to optimise the performance of your semantic cache. Furthermore, the self-hosting options within the Quartalis ecosystem gives you complete control over your data and infrastructure, allowing you to implement semantic caching in a secure and compliant environment. This is particularly important when dealing with sensitive data in industries like finance or healthcare.
Wrapping Up
Semantic caching is a powerful technique that can significantly reduce the cost and latency of AI applications. By understanding the semantic meaning of inputs, semantic caching can return cached results even when the inputs are not exactly the same. This leads to higher cache hit rates, lower API costs, and faster response times. By carefully considering the trade-offs and leveraging tools like the Quartalis ecosystem, you can effectively implement semantic caching in your own applications and unlock its full potential. Give the example code a try, and see how much you can improve your AI application’s performance!
Need this built for your business?
Get In TouchRelated Posts
Weekly Roundup: 16 March – 22 March 2026
This week in AI, self-hosting, and development — the most interesting stories and tools from 16 March – 22 March 2026.
Topic Consolidation: Turning Thousands of Chunks into Structured Knowledge
## Example Code: Topic Clustering ##
Weekly Roundup: 09 March – 15 March 2026
This week in AI, self-hosting, and development — the most interesting stories and tools from 09 March – 15 March 2026.