Hands-on Guide to LLM Caching with LangChain to Boost LLM Responses

LLM caching in LangChain addresses deployment challenges by storing and reusing generated responses.

LLM deployment comprises significant challenges specially centered around high costs, latency issues and scalability concerns. LLM caching is one mechanism that can address these challenges by intelligently storing and reusing LLM generated responses. LangChain, LLM orchestration framework, provides a caching mechanism that can boost the response generation and embedding creation using in-memory caching and SQLite-based caching types. This article provides a hands-on implementation of LLM caching using LangChain’s in-memory caching technique. 

Table of Contents

  1. Understanding LLM Caching
  2. Utilities of LLM Caching
  3. Implementing LLM Response Caching
  4. Implementing Vector Embedding Caching

Understanding LLM Caching

LLM caching is a technique used to improve the efficiency and performance of LLM by reducing the cost associated in deploying it. It uses the concept of storing and reusing the previously generated responses for identical and similar queries or prompts. LLM caching uses an in memory cache which stores cached responses to a prompt

When the LLM receives a prompt, the system first checks if an identical or very similar prompt exists in the cache. If found, it returns the cached response instead of generating a new one using the LLM. If not found, the prompt is sent to the LLM for generating the response. This newly generated response is then stored in the cache for future use. 

The prime benefit of implementing LLM caching is to reduce the latency associated with the response generation as per the prompt. Cached responses are much faster to be retrieved rather than generating entirely new ones. This approach allows for fewer API calls to the LLM provider/model thereby lowering the cost and reducing LLM usage expenses. 

LangChain caching is one mechanism implemented within the LangChain framework for the purpose of speeding up the LLM performance and reducing redundant LLM calls using the concept of LLM caching. LangChain primarily uses three types of caching: 

  1. In-Memory Cache – this type stores the generated responses in memory for rapid accessibility. 
  2. SQLite Cache – this type implements a local SQLite database for persistent cache data. 
  3. Redis Cache – this type uses distributed caching based on Redis, which is suitable for multi-server setups. 

Utilities and Challenges in LLM Caching

UtilitiesChallenges
Performance Improvement
  • Faster response time for cached prompts.
  • Reduced latency for frequently entered prompts.
Context Sensitivity
  • Handling prompts that require current context or real-time information is difficult.
  • Balancing cache responses with the need for fresh, contextually appropriate answers is not practical. 
Cost Efficiency
  • Reduction in API calls to LLM providers.
  • Overall cost savings due to optimal resource utilization.
Cache Storage and Management
  • Determining the optimal cache size and storage solution can prove to be a problem.
  • Managing cache growth and implementing efficient eviction policies is tough.
    Scalability and Consistency 
    • Ability to handle higher prompt volumes with existing infrastructure resources.
    • Maintaining coherence in multi-turn conversations across multiple users.
    Privacy and Security
    • Ensuring that the cached data doesn’t violate user privacy or data protection regulations is needed.
    • Implementing a secure storage for sensitive cached information can be an overhead.

    Implementing LLM Response Caching

    Let’s take a closer look on how to implement LLM response caching using LangChain: 

    Step 1: Install the required libraries 

    Step 2: Setup the OpenAI API Key

    Step 3: Setup the OpenAI model configuration for response generation

    Step 4: Set the LLM cache type and invoke the LLM for generating response

    Output:

    Memory caching is a technique used in computer systems to improve the performance of accessing data. It involves storing frequently used data in a faster and closer location, such as the computer’s main memory, rather than retrieving it from a slower and more distant location, such as the hard drive. This allows for quicker access to data, reducing the time and resources needed to retrieve it. When data is requested, the system checks the cache first and if the data is found, it is retrieved from the cache instead of the original location. This results in faster data retrieval and improved overall system performance.

    Step 5: Invoking the LLM again and checking the response generation time

    Output:

    Memory caching is a technique used in computer systems to improve the performance of accessing data. It involves storing frequently used data in a faster and closer location, such as the computer’s main memory, rather than retrieving it from a slower and more distant location, such as the hard drive. This allows for quicker access to data, reducing the time and resources needed to retrieve it. When data is requested, the system checks the cache first and if the data is found, it is retrieved from the cache instead of the original location. This results in faster data retrieval and improved overall system performance.

    We can clearly see the difference in time between Step 4 (without cache) and Step 5 (with cache) response generation  

    Implementing Vector Embedding Caching

    Let us now implement a vector embedding cache using Langchain’s caching mechanism. 

    Step 1: Install and import the required libraries 

    Step 2: Implementing an embedding cache 

    underlying_embeddings = OpenAIEmbeddings()
    store = LocalFileStore("./cache/")
    cached_embedder = CacheBackedEmbeddings.from_bytes_store(
       underlying_embeddings, store, namespace=underlying_embeddings.model
    )

    Step 3: Using a text file and processing it for vector embedding generation (loading and chunking)

    Step 4: Generating the vector embeddings based on FAISS and recording the time

    %%time
    
    db = FAISS.from_documents(documents, cached_embedder)

    Output:

    Step 5: Generating the vector embeddings again and comparing the time

    %%time
    
    db2 = FAISS.from_documents(documents, cached_embedder)

    Output:

    Final Words

    LLM caching and its implementation based on frameworks such as LangChain marks a significant leap in terms of optimal LLM usage and cost reduction. Using the idea of intelligently sorting and reusing responses generated by LLMs, these caching techniques offer a powerful solution to the challenges of cost, latency and scalability that are imbued in LLM application development and deployment. Effective caching strategies and their implementation in LLM space will play an increasingly crucial role in bridging the gap between LLM capabilities and the practical realities of deploying them at scale.  

    References

    1. Link to Code
    2. LangChain Embedding Caching
    3. LangChain LLM Caching Integration
    4. LangChain Model Caches
    Picture of Sachin Tripathi

    Sachin Tripathi

    Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

    The Chartered Data Scientist Designation

    Achieve the highest distinction in the data science profession.

    Elevate Your Team's AI Skills with our Proven Training Programs

    Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

    Our Accreditations

    Get global recognition for AI skills

    Chartered Data Scientist (CDS™)

    The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

    Certified Data Scientist - Associate Level

    Global recognition of data science skills at the beginner level.

    Certified Generative AI Engineer

    An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

    Join thousands of members and receive all benefits.

    Become Our Member

    We offer both Individual & Institutional Membership.