A Hands-on Guide to Enhance RAG with Re-Ranking

Learn how re-ranking in Retrieval-Augmented Generation boosts relevance, enhancing summarization and question answering accuracy.
Re-Ranking

Retrieval-Augmented Generation (RAG) is useful for summarising and answering questions. It blends the skills of Large Language Models (LLMs) with information retrieval capabilities. However, the first retrieval step of the RAG system usually retrieves multiple documents that may not all be that relevant to the query. At such times, re-ranking is important. Re-ranking makes it possible to reorganize and filter the responses. It uses the similarity search to find suitable documents and then ranks them according to the relevancy score. In this article, we will understand the workings of the re-ranking method and also use it to rank the responses retrieved by RAG.

Table of Contents

  1. Retrieval-Augmented Generation (RAG)
  2. Understanding Re-Ranking
  3. How Does Re-Ranking Work?
  4. Using Re-Ranking to Retrieve Enhanced Responses

Now, let us deep dive into the Re-Ranking method, understand how it works, and implement it. 

Retrieval-Augmented Generation (RAG)

RAG is a hybrid approach that combines retrieval-based and generation-based methods. It addresses complex queries by retrieving relevant documents from a large corpus and then generating a response based on the retrieved information. 

The RAG pipeline consists of two main stages:

Retriever: This retrieves the documents that are associated with the input query.

Generator: This part creates a logical and contextually relevant answer by using the documents, that were retrieved. 

Understanding Re-Ranking

In RAG, the important task is to find a relevant document in a large set of documents. To make this easy, we transform these documents into vectors, allowing them to be compared with the query using methods like cosine similarity.

However, transforming documents into vectors can cause a loss of information as vectors are simplified numerical representations of content. Also, larger documents often need to be split into smaller parts to create these vectors, which can make it difficult to keep the original context intact.

When using vector search in RAG, losing context can be a problem. This happens because we usually only look at the top results from the vector search, possibly missing other relevant information. As a result, if the most relevant parts aren’t included in these top results, the language model might generate a less accurate or useful response.

Re-ranking is a technique to enhance the retrieval process. It refines the initial set of retrieved documents. This ensures that the most relevant documents are prioritized for the generation of responses.

For example, if we want to search for the “history of pizza”, the system might retrieve documents about bread, cheese, and Italian cuisine. These all are relevant topics, even though they don’t directly answer our question. In such cases, re-ranking helps us sort through these documents and prioritize the ones that truly tell about the history of pizza.

How Does Re-Ranking in RAG Work?

Initial Retrieval

The retriever model pulls a broad set of candidate documents based on the input query. These documents are initially ranked using basic scoring methods.

Scoring and Ranking

For the direct retrieval, the scores reflect a rough estimate of relevance. However, these scores often lack a clear understanding of context.

Re-Ranking

A more sophisticated re-ranking model reassesses the relevance of each document. This model can leverage advanced features and techniques, such as:

  • Cross-Encoders: Jointly encoding the query and document to provide a more precise relevance score.
  • BERT-based Re-Rankers: Utilizing deep learning models like BERT that excel at understanding context and semantics.

Selection of Top Documents

The re-ranked documents are filtered to retain only the top ones, which are then fed into the generator.

Generation

The generator produces a final response using contextually rich and highly relevant documents.

Using Re-Ranking to Retrieve Enhanced Responses

As we know, the efficiency of RAG can be increased by using the re-ranking method. Here, we will use RAG to retrieve the relevant information from a document and re-ranking it to rank those retrieved documents. By using re-ranking, we can retrieve highly relevant responses, improve content understanding, and increase accuracy.

To begin with, install the required packages and import all the libraries.

%pip install pypdf langchain-chroma sentence_transformers torch torchvision
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import OpenAIEmbeddings
import openai
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer
from langchain_chroma import Chroma
from sentence_transformers import CrossEncoder

Next, load the document or dataset or load documents from the web. Here, we will be using PyPDFLoader of LangChain to load the document.

loader = PyPDFLoader("./Document/Harry Potter and the Sorcerers Stone.pdf")
pages = loader.load_and_split()

If the document is very large, we can split it and tokenize it as part of preprocessing.

text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
   tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L12-v2"),
   chunk_size=256,
   chunk_overlap=16,
   strip_whitespace=True,
)
docs = text_splitter.split_documents(pages)

Let us use OpenAIEmbeddings to embed the document and store it in a vector database

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vectordb = Chroma.from_documents(documents=docs, embedding=embeddings)

Now, we have a knowledge base in a vector database. We can give any query and use similarity search to find the contexts that are related to our query.

query = "What are the names of four houses in Hogwarts?"
docsnew = vectordb.similarity_search(query)
print(docsnew[0].page_content)

The output will be something like this:

could hear the drone of hundreds of voices from a doorway to the right-the rest of the school must already be here — but Professor McGonagallshowed the first years into a small, empty chamber off the hall. Theycrowded in, standing rather closer together than they would usually havedone, peering about nervously. “Welcome to Hogwarts,” said Professor McGonagall. “The start-of-term banquet will begin shortly, but before you take your seats in the Great Hall, you will be sorted into your houses. The Sorting is a veryimportant ceremony because, while you are here, your house will besomething like your family within Hogwarts. You will have classes withthe rest of your house, sleep in your house dormitory, and spend freetime in your house common room. “The four houses are called Gryffindor, Hufflepuff, Ravenclaw, and

We get multiple such outputs. But to get the most relevant one on the top, we will use the re-ranking method, Cross Encoder.

cross_encoder = CrossEncoder(
   "cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu"
)

By using cross-encoder, we can rank the responses and view them.

#cross encoder reranker
from sentence_transformers import CrossEncoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
document_texts = [doc.page_content for doc in docsnew]
response = [[query, doc_text] for doc_text in document_texts]
scores = cross_encoder.predict(response)

print("Scores:")
for score in scores:
   print(score)

And we can also see the responses according to these scores:

We used RAG to find and retrieve the responses related to the input query. At first, we retrieved many responses. To rank these responses, we used a cross-encoder and arranged them according to the priority score.

Thus, using re-ranking, it was possible to retrieve the most suitable response for the given query, which will increase the accuracy of the RAG model. Here, we have used a cross-encoder to re-rank. Other than this, we have Flash Ranker, the Colbertv2 model, and many others. 

Conclusion

In conclusion, re-ranking is a vital component of RAG systems, significantly enhancing the quality of search results by prioritizing the most relevant documents. The method of re-ranking involves a two-stage retrieval system, with re-rankers playing a crucial role in evaluating the relevance of each document to the query. RAG systems can be optimized to mitigate hallucinations and ensure dependable search outcomes by selecting the optimal reranking model.

References

  1. Link to Code
  2. Pinecone Documentation
  3. LangChain Cross Encoder

Learn more about RAG and Vector Databases. Enroll to the following course.

Picture of Shreepradha Hegde

Shreepradha Hegde

Shreepradha is an accomplished Associate Lead Consultant at AIM, showcasing expertise in AI and data science, specifically Generative AI. With a wealth of experience, she has consistently demonstrated exceptional skills in leveraging advanced technologies to drive innovation and insightful solutions. Shreepradha's dedication and strategic mindset have made her a valuable asset in the ever-evolving landscape of artificial intelligence and data science.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.