Retrieval Augmented Generation is one of large language models‘ most powerful and evolving techniques. RAG implementation in generative AI applications improves contextual understanding and lowers the risk of hallucinations. But, when it comes to complex PDFs, a document consisting of different objects, irregularities and diverse data formats, RAG suffers. This article showcases a practical approach for implementing RAG on complex PDFs using LlamaParse, which empowers RAG to generate more accurate and contextually relevant responses.
Table of Content
- Understanding Complex PDFs and RAG Challenges
- Understanding LlamaParse
- Features of LlamaParse
- Implementing RAG on Complex PDFs using LlamaParse
Understanding Complex PDFs
Complex PDFs are documents that go beyond traditional PDF text and layouts. These PDFs contain various elements that make it difficult to process and querying them using RAG. These PDFs consist of:
- Embedded Objects: Images, Webpages, audio & video files can be embedded within a PDF. These elements are not readily accessible as text data for RAG to process.
- Non-standard Layouts: Complex PDF’s may have multi-column layouts, irregular text placement, or artistic elements that disrupt the linear flow of text, making it difficult for RAG to accurately understand and extract meaningful information.
- Scanned Data: Certain PDFs are scanned versions of physical documents or may contain scanned data which have lower quality text, OCR error or missing structure, making it challenging for RAG.
- Vector Graphics and Charts: Complex drawings, shapes and diagrams might be stored using vector graphics instead of raster images which is difficult for RAG systems to utilize and understand.
Understanding LlamaParse
LlamaParse, a component of CloudLlama, directly integrates with LlamaIndex and is a document parsing platform built to parse and clean data for optimal RAG implementation. It’s pre-equipped with table extraction, JSON mode, image extraction and foreign language support.
Features of LlamaParse
LlamaParse offers a variety of benefits for RAG-based systems:
- LlamaParse analyses the complete PDF structure, including embedded objects, layout, and even vector graphics, and goes beyond simple text extraction.
- LlamaParse recognises and understands relationships between different PDF elements, such as image captions or text surrounding images.
- LlamaParse converts the information extracted from a complex PDF into a format more suitable for building an advanced generative AI model using RAG.
- LlamaParse is open-source and can seamlessly integrate with other LLM orchestration frameworks such as LlamaIndex.
- LlamaParse is available as a Python package (llama-parse), Web UI (https://cloud.llamaindex.ai/parse), RestAPI and is able to deal with different file formats such as PDF, DOCX, CSV, EPUB, etc.
Implementing RAG on Complex PDFs using LlamaParse
Step 1: Install the necessary libraries
We need llama-index and llama-parse to use DocumentReader, PDF-parsing, Vector-index creation, and a querying engine to run our queries.
!pip install llama-index llama-parse python-dotenv
Step 2: Import the libraries
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os
import nest_asyncio
from dotenv import load_dotenv
from IPython.display import Markdown, display
from google.colab import userdata
nest_asyncio.apply()
Step 3: Set up the OpenAI API and LlamaCloud API Keys
Obtain LlamaCloud API Key from https://cloud.llamaindex.ai/api-key.
os.environ["llamaparse_api"] = userdata.get("LLAMA_CLOUD_API_KEY")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_APIKEY")
Step 4: Setup LlamaParse
Define api_key as LlamaCloud API value and result_type as markdown (we can use text also).
parser = LlamaParse(
api_key=llamaparse_api,
result_type="markdown" # "markdown" and "text" are available
)
Step 5: Generate two sets of documents—one based on parsed PDF and another without parsing. Document1 represents a parsed PDF document, whereas Document2 is non-parsed.
file_extractor = {".pdf": parser}
documents1 = SimpleDirectoryReader(input_files=['/content/data/cellbiology.pdf'], file_extractor=file_extractor).load_data()
documents2 = SimpleDirectoryReader(input_files=['/content/data/cellbiology.pdf']).load_data()
Step 6: Create the vector index and a query engine
The code below shows index 1 and query_engine1 represent the parsed PDF documents1 based on llamaparse where index2 and query_engine2 represent the non-parsed PDF documents2.
index1 = VectorStoreIndex.from_documents(documents1)
query_engine1 = index1.as_query_engine()
index2 = VectorStoreIndex.from_documents(documents2)
query_engine2 = index2.as_query_engine()
Step 7: Execute the query
The code below executes the query, “What is the equation for the overall reaction catalysed by the electron transport chain?”, using both parsed and non-parsed query engines.
query1 = "What is the equation for the overall reaction catalysed by the electron transport chain?"
response = query_engine1.query(query1)
display(Markdown(f"<b>{response}</b>"))
query2 = "What is the equation for the overall reaction catalysed by the electron transport chain?"
response = query_engine2.query(query2)
display(Markdown(f"<b>{response}</b>"))
Output for parsed PDF :
Output for non-parsed PDF:
The query executed on parsed PDF gives a detailed and correct response that can be checked using the PDF data, whereas the query executed on non-parsed PDF doesn’t give the correct output.
PDF data screenshot showing the correct answer as per the query:
Final Words
LlamaParse is an excellent choice for RAG-based applications when it comes to analysing and understanding complex PDF documents. Several features of LlamaParse, such as extracting raw data, converting it into a suitable format for an optimal RAG, and being able to read and extract complex PDF contents, are evolutionary. They help RAG models gain a good understanding of data, leading to more accurate and contextually relevant results.