Retrieval-Augmented Generation (RAG) is useful for summarising and answering questions. It blends the skills of Large Language Models (LLMs) with information retrieval capabilities. However, the first retrieval step of the RAG system usually retrieves multiple documents that may not all be that relevant to the query. At such times, re-ranking is important. Re-ranking makes it possible to reorganize and filter the responses. It uses the similarity search to find suitable documents and then ranks them according to the relevancy score. In this article, we will understand the workings of the re-ranking method and also use it to rank the responses retrieved by RAG.
Table of Contents
- Retrieval-Augmented Generation (RAG)
- Understanding Re-Ranking
- How Does Re-Ranking Work?
- Using Re-Ranking to Retrieve Enhanced Responses
Now, let us deep dive into the Re-Ranking method, understand how it works, and implement it.
Retrieval-Augmented Generation (RAG)
RAG is a hybrid approach that combines retrieval-based and generation-based methods. It addresses complex queries by retrieving relevant documents from a large corpus and then generating a response based on the retrieved information.
The RAG pipeline consists of two main stages:
Retriever: This retrieves the documents that are associated with the input query.
Generator: This part creates a logical and contextually relevant answer by using the documents, that were retrieved.
Understanding Re-Ranking
In RAG, the important task is to find a relevant document in a large set of documents. To make this easy, we transform these documents into vectors, allowing them to be compared with the query using methods like cosine similarity.
However, transforming documents into vectors can cause a loss of information as vectors are simplified numerical representations of content. Also, larger documents often need to be split into smaller parts to create these vectors, which can make it difficult to keep the original context intact.
When using vector search in RAG, losing context can be a problem. This happens because we usually only look at the top results from the vector search, possibly missing other relevant information. As a result, if the most relevant parts aren’t included in these top results, the language model might generate a less accurate or useful response.
Re-ranking is a technique to enhance the retrieval process. It refines the initial set of retrieved documents. This ensures that the most relevant documents are prioritized for the generation of responses.
For example, if we want to search for the “history of pizza”, the system might retrieve documents about bread, cheese, and Italian cuisine. These all are relevant topics, even though they don’t directly answer our question. In such cases, re-ranking helps us sort through these documents and prioritize the ones that truly tell about the history of pizza.
How Does Re-Ranking in RAG Work?
Initial Retrieval
The retriever model pulls a broad set of candidate documents based on the input query. These documents are initially ranked using basic scoring methods.
Scoring and Ranking
For the direct retrieval, the scores reflect a rough estimate of relevance. However, these scores often lack a clear understanding of context.
Re-Ranking
A more sophisticated re-ranking model reassesses the relevance of each document. This model can leverage advanced features and techniques, such as:
- Cross-Encoders: Jointly encoding the query and document to provide a more precise relevance score.
- BERT-based Re-Rankers: Utilizing deep learning models like BERT that excel at understanding context and semantics.
Selection of Top Documents
The re-ranked documents are filtered to retain only the top ones, which are then fed into the generator.
Generation
The generator produces a final response using contextually rich and highly relevant documents.
Using Re-Ranking to Retrieve Enhanced Responses
As we know, the efficiency of RAG can be increased by using the re-ranking method. Here, we will use RAG to retrieve the relevant information from a document and re-ranking it to rank those retrieved documents. By using re-ranking, we can retrieve highly relevant responses, improve content understanding, and increase accuracy.
To begin with, install the required packages and import all the libraries.
%pip install pypdf langchain-chroma sentence_transformers torch torchvision
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
import openai
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer
from langchain_chroma import Chroma
from sentence_transformers import CrossEncoder
Next, load the document or dataset or load documents from the web. Here, we will be using PyPDFLoader of LangChain to load the document.
loader = PyPDFLoader("./Document/Harry Potter and the Sorcerers Stone.pdf")
pages = loader.load_and_split()
If the document is very large, we can split it and tokenize it as part of preprocessing.
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2"),
chunk_size=256,
chunk_overlap=16,
strip_whitespace=True,
)
docs = text_splitter.split_documents(pages)
Let us use OpenAIEmbeddings to embed the document and store it in a vector database.
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectordb = Chroma.from_documents(documents=docs, embedding=embeddings)
Now, we have a knowledge base in a vector database. We can give any query and use similarity search to find the contexts that are related to our query.
query = "What are the names of four houses in Hogwarts?"
docsnew = vectordb.similarity_search(query)
print(docsnew[0].page_content)
The output will be something like this:
could hear the drone of hundreds of voices from a doorway to the right-the rest of the school must already be here — but Professor McGonagallshowed the first years into a small, empty chamber off the hall. Theycrowded in, standing rather closer together than they would usually havedone, peering about nervously. “Welcome to Hogwarts,” said Professor McGonagall. “The start-of-term banquet will begin shortly, but before you take your seats in the Great Hall, you will be sorted into your houses. The Sorting is a veryimportant ceremony because, while you are here, your house will besomething like your family within Hogwarts. You will have classes withthe rest of your house, sleep in your house dormitory, and spend freetime in your house common room. “The four houses are called Gryffindor, Hufflepuff, Ravenclaw, and
We get multiple such outputs. But to get the most relevant one on the top, we will use the re-ranking method, Cross Encoder.
cross_encoder = CrossEncoder(
"cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu"
)
By using cross-encoder, we can rank the responses and view them.
#cross encoder reranker
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
document_texts = [doc.page_content for doc in docsnew]
response = [[query, doc_text] for doc_text in document_texts]
scores = cross_encoder.predict(response)
print("Scores:")
for score in scores:
print(score)
And we can also see the responses according to these scores:
We used RAG to find and retrieve the responses related to the input query. At first, we retrieved many responses. To rank these responses, we used a cross-encoder and arranged them according to the priority score.
Thus, using re-ranking, it was possible to retrieve the most suitable response for the given query, which will increase the accuracy of the RAG model. Here, we have used a cross-encoder to re-rank. Other than this, we have Flash Ranker, the Colbertv2 model, and many others.
Conclusion
In conclusion, re-ranking is a vital component of RAG systems, significantly enhancing the quality of search results by prioritizing the most relevant documents. The method of re-ranking involves a two-stage retrieval system, with re-rankers playing a crucial role in evaluating the relevance of each document to the query. RAG systems can be optimized to mitigate hallucinations and ensure dependable search outcomes by selecting the optimal reranking model.
References
Learn more about RAG and Vector Databases. Enroll to the following course.