Retrieval Augmented Generation (RAG) has become a popular strategy for enhancing language models by integrating external knowledge sources. Despite its strengths, RAG suffers from challenges such as retrieval latency, potential errors in document selection. These limitations highlight the need for a new and more efficient paradigm called Cache Augmented Generation (CAG). CAG leverages the extended context capabilities of large language models (LLMs) by preloading relevant documents and precomputing key value (KV) caches. This approach enables retrieval free question answering.
This article explores the methodology behind CAG, its advantages over RAG, and experimental results demonstrating its efficiency and accuracy in knowledge intensive tasks.
Table of Content
- Challenges with RAG
- What is Cache Augmented Generation (CAG)?
- Understanding CAG’s Framework
- Experimental Results
- When to Use CAG
Let’s start by understanding the challenges associated with Retrieval Augmented Generation (RAG).
Challenges with RAG
RAG introduces complexity by dynamically retrieving and integrating external knowledge during inference. This approach has several drawbacks:
- Retrieval Latency: Real time retrieval adds significant overhead to the inference process.
- Document Selection Errors: Mistakes in retrieval can lead to irrelevant or incorrect responses.
- System Complexity: Combining retrieval and generation components increases maintenance costs and tuning requirements.

What is Cache Augmented Generation (CAG)?
CAG eliminates the need for real time retrieval by preloading all relevant documents into the model’s extended context and precomputing KV caches. This streamlined approach simplifies the system architecture and ensures that inference relies solely on the preloaded context, improving both speed and reliability.
Key Advantages of CAG
- Reduced Inference Time: Preloading eliminates retrieval latency, enabling faster response times.
- Improved Accuracy: Holistically processing all relevant documents ensures more contextually accurate responses.
- Simplified Architecture: Without a separate retrieval pipeline, the system becomes easier to develop and maintain.

Understanding CAG’s Framework
The CAG framework operates in three phases:
External Knowledge Preloading: Relevant documents are curated and preprocessed to fit within the model’s context window. These are encoded into a KV cache that captures the model’s inference state, stored for reuse.

Inference with Precomputed Cache: During inference, the precomputed KV cache is loaded alongside the query, enabling the model to generate responses without additional retrieval steps.

Cache Reset and Management: As the KV cache grows during inference, a reset mechanism truncates appended tokens, maintaining performance without reloading the entire context.

Comparison of Traditional RAG vs CAG Workflows
This Architecture ensures that the computational cost of processing documents is incurred only once, making subsequent queries faster and more efficient.
Experimental Results
Datasets and Setup
CAG was evaluated on two benchmarks:
- SQuAD 1.0: Focuses on precise, context aware answers within single passages.
- HotPotQA: Emphasizes multi hop reasoning across multiple documents.
Experiments used the Llama 3.1 8B Instruction model with support for 128k token inputs. Reference texts were preloaded into KV caches, bypassing retrieval during inference.
Key Findings
Higher Accuracy: CAG outperformed RAG systems on BERTScore across all test configurations.
- On HotPotQA (large), CAG achieved a score of 0.7527, compared to 0.7398 for dense RAG.
Reduced Generation Time:
- On HotPotQA (large), CAG reduced generation time from 94.35 seconds (RAG) to 2.33 seconds.
Efficiency at Scale: Even with extensive reference texts, CAG maintained accuracy and speed, making it ideal for tasks involving large knowledge bases.

When to Use CAG
CAG is particularly effective in scenarios where:
- The knowledge base is manageable and can fit within the model’s extended context.
- Latency is critical, and real time retrieval introduces unacceptable delays.
- Simplicity is a priority, reducing the need for complex retrieval and ranking systems.
Final Words
Cache Augmented Generation offers a compelling alternative to RAG, streamlining knowledge intensive workflows by leveraging the extended context capabilities of modern LLMs. By eliminating retrieval latency and minimizing system complexity, CAG ensures efficient and accurate performance across a range of applications.
As long context models continue to evolve, CAG is poised to become a robust solution for knowledge integration tasks. Its efficiency and simplicity make it a promising paradigm for the future of natural language generation.