Hello World to LlamaCloud and LlamaParse

Complex Document Parser

Xin Cheng
3 min readApr 11, 2024

To chat with your document, you need a robust parser that can load the document and structure it in machine-readable way (e.g. sections, texts, tables, charts). PDF is a popular document format. Traditionally, there are PDF libraries like pypdf, PyMuPDF. Now there is one more.

complex PDFs with embedded tables and charts/figures to markdown format. Benefit of markdown, 2, simplicity, plain text, lots of data LLM is trained on is from Internet and in markdown format.

LlamaCloud: managed parsing, ingestion, and retrieval services

LlamaParse: vs. traditional PDF parser, RAG pipeline returns more correct answer with LlamaParse

Currently we primarily support PDFs with tables, but we are also building out better support for figures, and and an expanded set of the most popular document types: .docx, .pptx, .html.

If the roadmap can be realized, then it saves developers time from worrying about different document format and only need to care about markdown format.

Getting Started Guide

You can follow link above to register LlamaCloud and use LlamaParse.

Setup

pip3 install llama-index llama-parse python-dotenv llama-index-readers-file

If encounters weird import issue, uninstall and force reinstall.

pip3 uninstall llama-index
pip3 install llama-index llama-parse python-dotenv llama-index-readers-file - upgrade - no-cache-dir - force-reinstall
pip3 install llama-index-embeddings-azure-openai llama-index-llms-azure-openai

Test PDF

Test

Configure .env file accordingly.

# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()

# bring in deps
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings

# from llama_index import set_global_service_context

import os
import openai

azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("AZURE_OPENAI_API_VERSION")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_key = api_key

# create LLM and Embedding Model
# https://docs.llamaindex.ai/en/stable/examples/customization/llms/AzureOpenAI/
embed_model = AzureOpenAIEmbedding(
model=os.getenv("AZURE_OPENAI_EMBEDDINGS_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"),
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)

llm = AzureOpenAI(
model=os.getenv("AZURE_OPENAI_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
temperature=0,
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)

# https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/
Settings.llm = llm
Settings.embed_model = embed_model

# set up parser
parser = LlamaParse(
result_type="markdown" # "markdown" and "text" are available
)

# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=['libor_transition.pdf'], file_extractor=file_extractor).load_data()
print(documents)

# create an index from the parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create a query engine for the index
query_engine = index.as_query_engine()

# query the engine
query = "What Is LIBOR?"
response = query_engine.query(query)
print(response)

query = "What Is estimate value of LIBOR in the document?"
response = query_engine.query(query)
print(response)
# The estimated value of outstanding assets referencing LIBOR in the document is $24 trillion since 2016.

query = "What Is estimate value of LIBOR as of 2020?"
response = query_engine.query(query)
print(response)
# not working, The estimated value of outstanding assets referencing LIBOR as of 2020 is $24 trillion.

query = "What is LIBOR as of 2020 referenced as estimate?"
response = query_engine.query(query)
print(response)

query = "What Is Currently Outstanding Loans and Bonds?"
response = query_engine.query(query)
print(response)
# Currently, there are $6 trillion in outstanding loans and $1 trillion in outstanding bonds.

query = "What Is Value of Loans and Bonds Maturing After June 2023?"
response = query_engine.query(query)
print(response)
# The value of loans maturing after June 2023 is $3, and the value of bonds maturing after June 2023 is $0.3.

The answer for “LIBOR as of 2020” is not good (it always finds 2016 value). But it loads PDF file and converts to markdown format.

Appendix

Another test PDF

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified