Leveraging Approximate Caching for Faster Retrieval-Augmented Generation

Bergman, ShaiJi, ZhangKermarrec, Anne-MarieRandl, Mathis Benjamin ManuelPetrescu, Diana AndreeaPereira Pires, Rafaelde Vos, Marinus Abraham2025-03-182025-03-182025-03-112025-03-3110.1145/3721146.3721941https://infoscience.epfl.ch/handle/20.500.14299/247924Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge. However, RAG increases the end-toend inference time since looking for relevant documents from large vector databases is computationally expensive. To address this, we introduce Proximity, an approximate key-value cache that optimizes the RAG workflow by leveraging similarities in user queries. Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear, reducing reliance on expensive vector database lookups. We evaluate Proximity on the MMLU and MedRAG benchmarks, demonstrating that it significantly improves retrieval efficiency while maintaining response accuracy. Proximity reduces retrieval latency by up to 59% while maintaining accuracy and lowers the computational burden on the vector database. We also experiment with different similarity thresholds and quantify the trade-off between speed and recall. Our work shows that approximate caching is a viable and effective strategy for optimizing RAG-based systems.enCCS Concepts:Information systems → Retrieval models and ranking;Computing methodologies → Natural language generation Retrieval-Augmented Generation, Large Language Models, Approximate Caching, Neural Information Retrieval, Vector Databases, Query Optimization, Latency Reduction, Machine Learning SystemsLeveraging Approximate Caching for Faster Retrieval-Augmented Generationtext::objet présenté à une conférence::actes de conférence::article dans une conférence/papier de conférence