A Practical Guide to Text Generation from Complex PDFs using RAG with LlamaParse

Understand and implement advanced RAG on complex PDFs with LlamaParse.

Retrieval Augmented Generation is one of large language models‘ most powerful and evolving techniques. RAG implementation in generative AI applications improves contextual understanding and lowers the risk of hallucinations. But, when it comes to complex PDFs, a document consisting of different objects, irregularities and diverse data formats, RAG suffers. This article showcases a practical approach for implementing RAG on complex PDFs using LlamaParse, which empowers RAG to generate more accurate and contextually relevant responses. 

Table of Content

  1. Understanding Complex PDFs and RAG Challenges
  2. Understanding LlamaParse
  3. Features of LlamaParse
  4. Implementing RAG on Complex PDFs using LlamaParse

Understanding Complex PDFs 

Complex PDFs are documents that go beyond traditional PDF text and layouts. These PDFs contain various elements that make it difficult to process and querying them using RAG. These PDFs consist of:

  1. Embedded Objects: Images, Webpages, audio & video files can be embedded within a PDF. These elements are not readily accessible as text data for RAG to process. 
  2. Non-standard Layouts: Complex PDF’s may have multi-column layouts, irregular text placement, or artistic elements that disrupt the linear flow of text, making it difficult for RAG to accurately understand and extract meaningful information. 
  3. Scanned Data: Certain PDFs are scanned versions of physical documents or may contain scanned data which have lower quality text, OCR error or missing structure, making it challenging for RAG. 
  4. Vector Graphics and Charts: Complex drawings, shapes and diagrams might be stored using vector graphics instead of raster images which is difficult for RAG systems to utilize and understand. 

Understanding LlamaParse 

LlamaParse, a component of CloudLlama, directly integrates with LlamaIndex and is a document parsing platform built to parse and clean data for optimal RAG implementation. It’s pre-equipped with table extraction, JSON mode, image extraction and foreign language support. 

LlamaParse Web UI

Features of LlamaParse

LlamaParse offers a variety of benefits for RAG-based systems:

  1. LlamaParse analyses the complete PDF structure, including embedded objects, layout, and even vector graphics, and goes beyond simple text extraction. 
  2. LlamaParse recognises and understands relationships between different PDF elements, such as image captions or text surrounding images. 
  3. LlamaParse converts the information extracted from a complex PDF into a format more suitable for building an advanced generative AI model using RAG.
  4. LlamaParse is open-source and can seamlessly integrate with other LLM orchestration frameworks such as LlamaIndex. 
  5. LlamaParse is available as a Python package (llama-parse), Web UI (https://cloud.llamaindex.ai/parse), RestAPI and is able to deal with different file formats such as PDF, DOCX, CSV, EPUB, etc. 

Implementing RAG on Complex PDFs using LlamaParse

Step 1: Install the necessary libraries 

We need llama-index and llama-parse to use DocumentReader, PDF-parsing, Vector-index creation, and a querying engine to run our queries. 

!pip install llama-index llama-parse python-dotenv

Step 2: Import the libraries

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
import os
import nest_asyncio
from dotenv import load_dotenv
from IPython.display import Markdown, display
from google.colab import userdata
nest_asyncio.apply()

Step 3: Set up the OpenAI API and LlamaCloud API Keys

Obtain LlamaCloud API Key from https://cloud.llamaindex.ai/api-key.

os.environ["llamaparse_api"] = userdata.get("LLAMA_CLOUD_API_KEY")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_APIKEY")

Step 4: Setup LlamaParse

Define api_key as LlamaCloud API value and result_type as markdown (we can use text also). 

parser = LlamaParse(
   api_key=llamaparse_api,
   result_type="markdown"  # "markdown" and "text" are available
)

Step 5: Generate two sets of documents—one based on parsed PDF and another without parsing. Document1 represents a parsed PDF document, whereas Document2 is non-parsed. 

file_extractor = {".pdf": parser}

documents1 = SimpleDirectoryReader(input_files=['/content/data/cellbiology.pdf'], file_extractor=file_extractor).load_data()

documents2 = SimpleDirectoryReader(input_files=['/content/data/cellbiology.pdf']).load_data()

Step 6: Create the vector index and a query engine

The code below shows index 1 and query_engine1 represent the parsed PDF documents1 based on llamaparse where index2 and query_engine2 represent the non-parsed PDF documents2.

index1 = VectorStoreIndex.from_documents(documents1)
query_engine1 = index1.as_query_engine()

index2 = VectorStoreIndex.from_documents(documents2)
query_engine2 = index2.as_query_engine()

Step 7: Execute the query

The code below executes the query, “What is the equation for the overall reaction catalysed by the electron transport chain?”, using both parsed and non-parsed query engines. 

query1 = "What is the equation for the overall reaction catalysed by the electron transport chain?"
response = query_engine1.query(query1)
display(Markdown(f"<b>{response}</b>"))

query2 = "What is the equation for the overall reaction catalysed by the electron transport chain?"
response = query_engine2.query(query2)
display(Markdown(f"<b>{response}</b>"))

Output for parsed PDF :

Output for non-parsed PDF: 

The query executed on parsed PDF gives a detailed and correct response that can be checked using the PDF data, whereas the query executed on non-parsed PDF doesn’t give the correct output. 

PDF data screenshot showing the correct answer as per the query:

Final Words

LlamaParse is an excellent choice for RAG-based applications when it comes to analysing and understanding complex PDF documents. Several features of LlamaParse, such as extracting raw data, converting it into a suitable format for an optimal RAG, and being able to read and extract complex PDF contents, are evolutionary. They help RAG models gain a good understanding of data, leading to more accurate and contextually relevant results. 

References

  1. Link to the above Code
  2. LlamaCloud (LlamaParse) Documentation
  3. LlamaIndex Documentation
Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.