Deep Lake is an open-source lakehouse for AI applications developed by ActiveLoop. It goes beyond a standard data lake and excels in storing data types such as text, images, audio and video. It eliminates the need for complex data conversion allowing efficient storage and retrieval of high-dimensional vectors, which is crucial for AI tasks such as RAG, similarity search and recommendation systems. This article explores the Deep Lake vector store based on its utility for building Retrieval Augmented Generation systems.
Table of Contents
- Understanding Deep Lake Vector Database
- Deep Lake’s Tensor Storage Format
- Managed Tensor Database and Deep Memory
- Hands-on Implementation of Deep Lake for RAG
Understanding Deep Lake Vector Database
Deep Lake is a lakehouse specialised for deep learning workloads. It retains the advantages of a traditional data lake with an added advantage involving complex data storage, such as images, videos, annotations and tabular data, as tensors. These tensors enable Deep Lake to rapidly downstream data to deep learning frameworks over the network without sacrificing GPU utilisation.
The lakehouse implements and uses a tensor storage format, streaming dataloader, tensor query language and an in-browser visualisation engine. The tensor storage format stores dynamically shaped arrays on object storage. The streaming dataloader schedules fetching, decompression and user-defined transformations, enabling an optimised data transfer throughput to GPUs. The in-browser visualisation engine streams data from object storage and renders it in the browser.
Deep Lake Tensor Storage Format
Deep Lake datasets follow a columnar storage architecture based on tensors as columns. Each tensor is a collection of binary blobs containing data samples, known as tensors. An index map is associated with every tensor helping find the right chunk and index of the sample within that chunk for a given sample index.
A dataset sample represents a single row indexed across parallel tensors. These sample elements are logically independent, enabling partial access to samples for running queries or streaming selected tensors over the network to the GPU training instances.
Multiple tensors can also be grouped, these groups implement syntactic nesting which avoids format complications for hierarchical memory layout. The dataset schema changes are also tracked over time with version control.
Managed Tensor Database and Deep Memory
The managed tensor database is the core database system designed to store and manage data efficiently. It is optimised for large datasets, supporting real-time data streaming and it integrates well with machine learning frameworks.
Deep Lake Managed vs Embedded Database
Deep memory, on the other hand, improves the accuracy of finding similar data points and tailors the vector embeddings for a better LLM performance.
Hands-on Implementation of Deep Lake for RAG
This section implements Retrieval Augmented Generation using LlamaIndex and Deep Lake Vector Store.
Step 1: Visit https://app.activeloop.ai/ and create a Deep Lake API Key
Step 2: Installing the required libraries
- deeplake – It’s used for using Deep Lake database.
- llama-index – It’s used for using LlamaIndex LLM orchestration framework.
- llama-index-vector-stores-deeplake – It’s used for implementing Deep Lake Vector Store based on LlamaIndex.
!pip3 install deeplake llama-index llama-index-vector-stores-deeplake
Step 3: Importing the libraries
- VectorStoreIndex helps in creating and managing a vector index for data.
- SimpleDirectoryReader serves as a data reader designed to load data from a directory.
- StorageContext acts as a central hub for keeping track of different storage components used by LlamaIndex.
- DeepLakeVectorStore enables in creation and working with the Deep Lake database based on vector embeddings.
import os
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core import StorageContext
from google.colab import userdata
Step 4: Setting up the API keys for OpenAI and ActiveLoop (Deep Lake)
os.environ['ACTIVELOOP_TOKEN'] = userdata.get('ACTIVELOOP_TOKEN')
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API_KEY')
Step 5: Loading data from a directory using SimpleDirectoryReader and creating a new Deep Lake vector store for storing the data.
documents = SimpleDirectoryReader('/content/data/').load_data()
dataset_path = 'hub://sachintripathi/game_of_thrones_data'
vector_store = DeepLakeVectorStore(dataset_path=dataset_path, overwrite=True)
Step 6: Indexing the data and uploading it on the Deep Lake dataset.
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context,
)
Output
Output (Deep Lake Web UI)
Step 7: Executing a query and checking the results.
Query 1
query_engine = index.as_query_engine()
response = query_engine.query(
"What is this data about?",
)
print(response)
Output
The data provided is about the plot and dialogue excerpts from the television series “Game of Thrones.” It includes scenes involving various characters such as Cersei Lannister, Tywin Lannister, Tyrion Lannister, Daenerys Targaryen, Jorah Mormont, and others, set in different locations like the Riverlands and across the Narrow Sea. The excerpts depict discussions, events, and decisions related to battles, alliances, and personal struggles within the fictional world of Westeros.
Query 2
response = query_engine.query(
"Can you tell me about Jon Snow in detail?",
)
print(response)
Output
Jon Snow is a member of the Night’s Watch who is shown to be conflicted between his duty to the Night’s Watch and his personal desires. He is depicted as a skilled fighter and a loyal friend to those he cares about. Jon is determined to find his missing brother and is willing to go against the rules of the Night’s Watch to achieve his goals. Despite facing challenges and opposition from his fellow Night’s Watch brothers, Jon remains steadfast in his beliefs and is shown to be a strong and honorable character throughout the series.
The query returns accurate information based on our data (Game of Thrones Script) using RAG and vector similarity searching.
Final Words
Deep Lake is one of the great choices for managing and storing different types and its ability to handle them compliments the AI research and development. Other notable features of Deep Lake include multi-cloud support (involving S3, GCP, and Azure), and integrations with AI tools and frameworks (LangChain, LlamaIndex, Weights & Biases, MMDetection) which makes it a handy choice for LLM researches and end-users.