Hello World to LlamaCloud and LlamaParse

Complex Document Parser

3 min readApr 11, 2024

To chat with your document, you need a robust parser that can load the document and structure it in machine-readable way (e.g. sections, texts, tables, charts). PDF is a popular document format. Traditionally, there are PDF libraries like pypdf, PyMuPDF. Now there is one more.

Parsing PDFs(text, image and tables) for RAG based applications using LlamaParse (LlamaIndex).

In this article, we are going to show how the recent LlamaParse Reader update from LlamaIndex is going to help us in…

medium.com

complex PDFs with embedded tables and charts/figures to markdown format. Benefit of markdown, 2, simplicity, plain text, lots of data LLM is trained on is from Internet and in markdown format.

Introducing LlamaCloud and LlamaParse - LlamaIndex, Data Framework for LLM Applications

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs).

www.llamaindex.ai

LlamaCloud: managed parsing, ingestion, and retrieval services

LlamaParse: vs. traditional PDF parser, RAG pipeline returns more correct answer with LlamaParse

Currently we primarily support PDFs with tables, but we are also building out better support for figures, and and an expanded set of the most popular document types: .docx, .pptx, .html.

If the roadmap can be realized, then it saves developers time from worrying about different document format and only need to care about markdown format.

Getting Started Guide

LlamaParse | LlamaCloud Documentation

Welcome to LlamaParse, the first public-facing release in LlamaCloud! To get started, head to cloud.llamaindex.ai…

docs.cloud.llamaindex.ai

You can follow link above to register LlamaCloud and use LlamaParse.

Setup

pip3 install llama-index llama-parse python-dotenv llama-index-readers-file

If encounters weird import issue, uninstall and force reinstall.

[Bug]: import error generated by SimpleDirectoryReader() · Issue #10773 · run-llama/llama_index

Bug Description After upgrading to 0.10.4 SimpleDirectoryReader() generates an import error as it cannot find…

github.com

pip3 uninstall llama-index
pip3 install llama-index llama-parse python-dotenv llama-index-readers-file - upgrade - no-cache-dir - force-reinstall
pip3 install llama-index-embeddings-azure-openai llama-index-llms-azure-openai

Test PDF

Test

Configure .env file accordingly.

# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()

# bring in deps
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings

# from llama_index import set_global_service_context

import os
import openai

azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("AZURE_OPENAI_API_VERSION")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_key = api_key

# create LLM and Embedding Model
# https://docs.llamaindex.ai/en/stable/examples/customization/llms/AzureOpenAI/
embed_model = AzureOpenAIEmbedding(
    model=os.getenv("AZURE_OPENAI_EMBEDDINGS_MODEL_NAME"),
    deployment_name=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"),
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

llm = AzureOpenAI(
    model=os.getenv("AZURE_OPENAI_MODEL_NAME"),
    deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
    temperature=0,
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

# https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/
Settings.llm = llm
Settings.embed_model = embed_model

# set up parser
parser = LlamaParse(
    result_type="markdown"  # "markdown" and "text" are available
)

# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=['libor_transition.pdf'], file_extractor=file_extractor).load_data()
print(documents)

# create an index from the parsed markdown
index = VectorStoreIndex.from_documents(documents)

# create a query engine for the index
query_engine = index.as_query_engine()

# query the engine
query = "What Is LIBOR?"
response = query_engine.query(query)
print(response)

query = "What Is estimate value of LIBOR in the document?"
response = query_engine.query(query)
print(response)
# The estimated value of outstanding assets referencing LIBOR in the document is $24 trillion since 2016.

query = "What Is estimate value of LIBOR as of 2020?"
response = query_engine.query(query)
print(response)
# not working, The estimated value of outstanding assets referencing LIBOR as of 2020 is $24 trillion.

query = "What is LIBOR as of 2020 referenced as estimate?"
response = query_engine.query(query)
print(response)

query = "What Is Currently Outstanding Loans and Bonds?"
response = query_engine.query(query)
print(response)
# Currently, there are $6 trillion in outstanding loans and $1 trillion in outstanding bonds.

query = "What Is Value of Loans and Bonds Maturing After June 2023?"
response = query_engine.query(query)
print(response)
# The value of loans maturing after June 2023 is $3, and the value of bonds maturing after June 2023 is $0.3.

The answer for “LIBOR as of 2020” is not good (it always finds 2016 value). But it loads PDF file and converts to markdown format.

Appendix

Another test PDF