Build a Question Answering Pipeline with Weaviate Vector Store and LangChain

Explore Weaviate Vector Store and LangChain for advanced Q&A systems

The search for efficient question-and-answer systems has been ongoing in the enormous landscape of information retrieval and natural language processing. From classic keyword searches to the introduction of neural networks and deep learning, the quest for correct, context-aware responses has remained a cornerstone of current artificial intelligence applications. In this hands-on article, we will explore Weaviate vector store and LangChain, two powerful tools at the forefront. We will build a question-answering pipeline using LangChain modules and Weaviate search.

Table of contents

  1. Understanding Vector Stores
  2. Overview of Weaviate Database
  3. Features of LangChain
  4. Using Weaviate Vectore Store for Q&A Pipeline

Let’s start by understanding the vector store, which would be the core of the question-answer pipeline.

Understanding Vector Stores

Vector stores are specialized systems designed to store and manage high-dimensional vectors efficiently, and mathematical representations of data points in a multi-dimensional space. These systems excel at quickly retrieving and comparing vector data, making them ideal for applications requiring similarity searches or recommendation systems

Vector stores find extensive utility in various domains such as e-commerce for product recommendations, content-based image retrieval systems, and personalized advertising. Their ability to swiftly process high-dimensional data makes them indispensable for tasks demanding rapid query responses. Here are some of the key Characteristics of Vector Stores

  1. High-Dimensional Data Handling: Vector stores are optimized to handle high-dimensional data, common in applications involving machine learning, computer vision, and natural language processing. They efficiently store and retrieve vectors with hundreds or thousands of dimensions.
  2. Fast Query Performance: Vector stores are designed to quickly respond to queries involving vector similarity searches, such as finding the nearest neighbours or calculating distances between vectors. This fast query performance is critical in applications where real-time responses are necessary.
  3. Scalability: Vector stores are designed to increase data quantities and user demand. They can manage enormous volumes of data and provide fast query throughput, making them ideal for large-scale applications.
  4. Data Retrieval and Comparison: Vector stores provide efficient mechanisms for retrieving and comparing vectors. They often employ indexing techniques and algorithms optimized for vector data to accelerate query performance.

Source: Generative Agents White Paper

Overview of Weaviate Database

Weaviate is a database and search engine designed for vectorized data. Weaviate and other vector-optimized databases are contemporary innovations in data storage systems. They excel at storing unstructured data (such as text, images, and audio) as vectors in a continuous, multidimensional space.

Previously, unstructured data could only be searched via metadata or inverted indexes. Vector representations, on the other hand, allow us to do searches using semantic similarity functions. We may project our searches onto the high-dimensional search space using nearest-neighbour algorithms (KNN) and cosine similarity and obtain the results that are most comparable to them. This is referred to as semantic search (or ‘neural search’).

Source: Weaviate

At the core of weaviate the modules play a crucial role without any modules attached; it is a pure vector-native database. Weaviate stores data as a mix of an object and a vector, which may be searched using the given vector index technique. Weaviate does not know how to vectorize an object or how to calculate the vectors associated with it if no vectorizer modules are attached.

Depending on the type of data you want to store and search (text, photos, etc.), as well as the use case (such as search, question answering, etc., depending on language, classification, ML model, training set, etc.), you may select and attach the vectorizer module that best suits your needs. Alternatively, you may “bring your own” vectors to Weaviate.

Features of LangChain

LangChain is a sophisticated framework designed to enhance natural language processing tasks by simplifying data retrieval from various sources. It aids in implementing the Retrieval Augmented Generation (RAG) pattern in applications, contributing to more contextually aware AI applications. Here are the key components of LangChain.

  • Large Language Models (LLMs): The backbone of LangChain, providing the core capability for understanding and generating language. They are trained on vast datasets to produce text that is coherent and contextually relevant.
  • Prompt Templates: Designed to efficiently interact with LLMs, structuring the input in a way that maximizes the effectiveness of the language models in understanding and responding to queries.
  • Indexes: Serve as databases, organizing and storing information in a structured manner, enabling efficient retrieval of relevant data when the system processes language queries.
  • Retrievers: Work alongside indexes, quickly fetching the relevant information from the indexes based on the input query, ensuring that the response generation is informed and accurate.
  • Output Parsers: Process the language generated by LLMs, refining the output into a format useful and relevant to the specific task at hand.
  • Vector Store: This store handles the embedding of words or phrases into numerical vectors, crucial for tasks involving semantic analysis and understanding the nuances of language.
  • Agents: The decision-making components determine the best course of action based on the input, the context, and the available resources within the system.

QA Pipeline Using Weaviate Vectore Store

In this experimentation, we would build a question-answering pipeline using LangChain framework, and for retrieval and documentation, we would use Weaviate. To set up the project, we are required to install some important dependencies and prerequisites. 

!pip install -U weaviate-client
!pip install -Uqq langchain-weaviate
!pip install openai tiktoken langchain

The “weaviate-client” would help to connect and interact with the Weaviate cluster to store and retrieve information. Similarly, to build a pipeline we would use LangChain and weaviate search we would require “langchain-weaviate” module. The other dependencies are to access OpenAI’s powerful language models like GPT-3 and GPT-4 to power your ChatGPT plugin

For the experimentation, we would be using the Weaviate Cloud Services (WCS), which is a fully managed vector database in the cloud that simplifies the development of AI applications. If you want to learn more about Weaviate and Weaviate Cloud Service, read here. To connect to the WCS we require three things one is the URL of the hosted weaviate cluster, the API key to access the cluster and the third is OpenAI’s API key since we would be using the services. Here is the code snippet.

weaviate_client = weaviate.connect_to_wcs(
   cluster_url=userdata.get('CLUSTER_URL'),
   auth_credentials=weaviate.auth.AuthApiKey(userdata.get('WEAVIATE_API')),
   headers={
       "X-OpenAI-Api-Key": userdata.get('OPENAI_API_KEY')  # Replace with your inference API key
   }
)

Since we are using the Google Colab for the experimentation we could easily store the secret keys and url without exposing them to the public. To store these valuables securely you can click on the key icon on the left-hand side panel in which you need to provide the name of the token and the value of the token. 

We would be using the text data which is the transcription of the State of the Union addresses. These are annual speeches given by the President of the United States to a joint session of the United States Congress. Here is the link to the data.

data = TextLoader("../state_of_the_union.txt")
text_doc = data.load()
doc_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
util_docs = doc_splitter.split_documents(text_doc)
embeddings = OpenAIEmbeddings(openai_api_key=userdata.get('OPENAI_API_KEY'))

The above line code segment loads a text file containing the State of the Union address. It then splits the text into smaller chunks of 1000 characters each, with no overlap between chunks. This is accomplished by using the CharacterTextSplitter and TextLoader functions of the langchain and langchain_community modules. The result of this process is a list or similar data structure containing the smaller text chunks, stored in the variable util_docs. These chunks are represented in the high-dimensional space using the OpenAI’s embedding services.

The data is converted into a vector is stored in the vector store. Now we would use the weaviate knowledge graph system to perform a similarity search based on embeddings generated from text data. Here is the code snippet.

vector_store = WeaviateVectorStore.from_documents(util_docs, embeddings, client=weaviate_client)
query = "What did the president say about Vladimir Putin?"
response = vector_store.similarity_search(query)


# Print the first 100 characters of each result
for i, doc in enumerate(response):
   print(f"\nResult {i+1}:")
   print(doc.page_content[:100] + "...")

The above code initializes a WeaviateVectorStore object using embeddings derived from the smaller text chunks obtained earlier and interacts with a Weaviate instance via a client object. A query, asking about the president’s statements regarding Vladimir Putin, is defined and used to search for similar documents in the text data. The results are then iterated through, with the first 100 characters of each document printed for a brief preview of the content deemed similar to the query. Below are the results.

Now we can create our pipeline for the question and answer using the chain feature of the LangChain. The pipeline consists of several components, including a retriever, a prompt template, LLM, and a string output parser. Here is the code snippet, for the complete code refer to the Google Colab notebook in the references section.

chain = (
   {"context": retriever, "question": RunnablePassthrough()}
   | prompt_template
   | ll_model
   | StrOutputParser()
)


chain.invoke("What did the president say about COVID-19?")

Here is the output.

The president stated that COVID-19 should not control our lives and that we should not just accept living with it. He emphasized the importance of staying protected with vaccines and treatments and continuing to combat the virus. The president also highlighted the progress made in vaccinating Americans and the need to continue efforts to vaccinate more people.

Conclusion

Through experimentation, we have learned to build a question-answering pipeline using the LangChain features and the Weaviate vector store. The Weaviate vector store offers semantic search, which helps in accurate and faster retrieval of data from the store.

References

  1. Link to the above code
  2. Learn more about Weaviate

Have an in-depth learning on Vector Search Techniques with Weaviate. Take the following course.

Picture of Sourabh Mehta

Sourabh Mehta

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.