LLM deployment comprises significant challenges specially centered around high costs, latency issues and scalability concerns. LLM caching is one mechanism that can address these challenges by intelligently storing and reusing LLM generated responses. LangChain, LLM orchestration framework, provides a caching mechanism that can boost the response generation and embedding creation using in-memory caching and SQLite-based caching types. This article provides a hands-on implementation of LLM caching using LangChain’s in-memory caching technique.
Table of Contents
- Understanding LLM Caching
- Utilities of LLM Caching
- Implementing LLM Response Caching
- Implementing Vector Embedding Caching
Understanding LLM Caching
LLM caching is a technique used to improve the efficiency and performance of LLM by reducing the cost associated in deploying it. It uses the concept of storing and reusing the previously generated responses for identical and similar queries or prompts. LLM caching uses an in memory cache which stores cached responses to a prompt.
When the LLM receives a prompt, the system first checks if an identical or very similar prompt exists in the cache. If found, it returns the cached response instead of generating a new one using the LLM. If not found, the prompt is sent to the LLM for generating the response. This newly generated response is then stored in the cache for future use.
The prime benefit of implementing LLM caching is to reduce the latency associated with the response generation as per the prompt. Cached responses are much faster to be retrieved rather than generating entirely new ones. This approach allows for fewer API calls to the LLM provider/model thereby lowering the cost and reducing LLM usage expenses.
LangChain caching is one mechanism implemented within the LangChain framework for the purpose of speeding up the LLM performance and reducing redundant LLM calls using the concept of LLM caching. LangChain primarily uses three types of caching:
- In-Memory Cache – this type stores the generated responses in memory for rapid accessibility.
- SQLite Cache – this type implements a local SQLite database for persistent cache data.
- Redis Cache – this type uses distributed caching based on Redis, which is suitable for multi-server setups.
Utilities and Challenges in LLM Caching
Utilities | Challenges |
---|---|
Performance Improvement
| Context Sensitivity
|
Cost Efficiency
| Cache Storage and Management
|
Scalability and Consistency
| Privacy and Security
|
Implementing LLM Response Caching
Let’s take a closer look on how to implement LLM response caching using LangChain:
Step 1: Install the required libraries
%pip install -qU langchain_openai langchain_community
Step 2: Setup the OpenAI API Key
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
Step 3: Setup the OpenAI model configuration for response generation
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo-instruct")
Step 4: Set the LLM cache type and invoke the LLM for generating response
%%time
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())
llm.invoke("What is memory caching? Explain in less than 100 words")
Output:
CPU times: user 35 ms, sys: 1.13 ms, total: 36.2 ms
Wall time: 1.8 s
Memory caching is a technique used in computer systems to improve the performance of accessing data. It involves storing frequently used data in a faster and closer location, such as the computer’s main memory, rather than retrieving it from a slower and more distant location, such as the hard drive. This allows for quicker access to data, reducing the time and resources needed to retrieve it. When data is requested, the system checks the cache first and if the data is found, it is retrieved from the cache instead of the original location. This results in faster data retrieval and improved overall system performance.
Step 5: Invoking the LLM again and checking the response generation time
%%time
llm.invoke("What is memory caching? Explain in less than 100 words")
Output:
CPU times: user 716 µs, sys: 0 ns, total: 716 µs
Wall time: 752 µs
Memory caching is a technique used in computer systems to improve the performance of accessing data. It involves storing frequently used data in a faster and closer location, such as the computer’s main memory, rather than retrieving it from a slower and more distant location, such as the hard drive. This allows for quicker access to data, reducing the time and resources needed to retrieve it. When data is requested, the system checks the cache first and if the data is found, it is retrieved from the cache instead of the original location. This results in faster data retrieval and improved overall system performance.
We can clearly see the difference in time between Step 4 (without cache) and Step 5 (with cache) response generation
Implementing Vector Embedding Caching
Let us now implement a vector embedding cache using Langchain’s caching mechanism.
Step 1: Install and import the required libraries
%pip install --upgrade --quiet langchain-openai faiss-cpu
from langchain.storage import LocalFileStore
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.embeddings.cache import CacheBackedEmbeddings
Step 2: Implementing an embedding cache
underlying_embeddings = OpenAIEmbeddings()
store = LocalFileStore("./cache/")
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
underlying_embeddings, store, namespace=underlying_embeddings.model
)
Step 3: Using a text file and processing it for vector embedding generation (loading and chunking)
raw_documents = TextLoader("GOT_script.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
Step 4: Generating the vector embeddings based on FAISS and recording the time
%%time
db = FAISS.from_documents(documents, cached_embedder)
Output:
CPU times: user 819 ms, sys: 46.7 ms, total: 866 ms
Wall time: 1.71 s
Step 5: Generating the vector embeddings again and comparing the time
%%time
db2 = FAISS.from_documents(documents, cached_embedder)
Output:
CPU times: user 4.74 ms, sys: 56 µs, total: 4.79 ms
Wall time: 12 ms
Caching and re-generating the vector embedding is faster as there is no need for recomputing the embeddings.
Final Words
LLM caching and its implementation based on frameworks such as LangChain marks a significant leap in terms of optimal LLM usage and cost reduction. Using the idea of intelligently sorting and reusing responses generated by LLMs, these caching techniques offer a powerful solution to the challenges of cost, latency and scalability that are imbued in LLM application development and deployment. Effective caching strategies and their implementation in LLM space will play an increasingly crucial role in bridging the gap between LLM capabilities and the practical realities of deploying them at scale.