Hello World to LlamaCloud and LlamaParse
To chat with your document, you need a robust parser that can load the document and structure it in machine-readable way (e.g. sections, texts, tables, charts). PDF is a popular document format. Traditionally, there are PDF libraries like pypdf, PyMuPDF. Now there is one more.
complex PDFs with embedded tables and charts/figures to markdown format. Benefit of markdown, 2, simplicity, plain text, lots of data LLM is trained on is from Internet and in markdown format.
LlamaCloud: managed parsing, ingestion, and retrieval services
LlamaParse: vs. traditional PDF parser, RAG pipeline returns more correct answer with LlamaParse
Currently we primarily support PDFs with tables, but we are also building out better support for figures, and and an expanded set of the most popular document types: .docx, .pptx, .html.
If the roadmap can be realized, then it saves developers time from worrying about different document format and only need to care about markdown format.
Getting Started Guide
You can follow link above to register LlamaCloud and use LlamaParse.
Setup
pip3 install llama-index llama-parse python-dotenv llama-index-readers-file
If encounters weird import issue, uninstall and force reinstall.
pip3 uninstall llama-index
pip3 install llama-index llama-parse python-dotenv llama-index-readers-file - upgrade - no-cache-dir - force-reinstall
pip3 install llama-index-embeddings-azure-openai llama-index-llms-azure-openai
Test
Configure .env file accordingly.
# bring in our LLAMA_CLOUD_API_KEY
from dotenv import load_dotenv
load_dotenv()
# bring in deps
from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings
# from llama_index import set_global_service_context
import os
import openai
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("AZURE_OPENAI_API_VERSION")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_key = api_key
# create LLM and Embedding Model
# https://docs.llamaindex.ai/en/stable/examples/customization/llms/AzureOpenAI/
embed_model = AzureOpenAIEmbedding(
model=os.getenv("AZURE_OPENAI_EMBEDDINGS_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"),
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)
llm = AzureOpenAI(
model=os.getenv("AZURE_OPENAI_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
temperature=0,
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)
# https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/
Settings.llm = llm
Settings.embed_model = embed_model
# set up parser
parser = LlamaParse(
result_type="markdown" # "markdown" and "text" are available
)
# use SimpleDirectoryReader to parse our file
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(input_files=['libor_transition.pdf'], file_extractor=file_extractor).load_data()
print(documents)
# create an index from the parsed markdown
index = VectorStoreIndex.from_documents(documents)
# create a query engine for the index
query_engine = index.as_query_engine()
# query the engine
query = "What Is LIBOR?"
response = query_engine.query(query)
print(response)
query = "What Is estimate value of LIBOR in the document?"
response = query_engine.query(query)
print(response)
# The estimated value of outstanding assets referencing LIBOR in the document is $24 trillion since 2016.
query = "What Is estimate value of LIBOR as of 2020?"
response = query_engine.query(query)
print(response)
# not working, The estimated value of outstanding assets referencing LIBOR as of 2020 is $24 trillion.
query = "What is LIBOR as of 2020 referenced as estimate?"
response = query_engine.query(query)
print(response)
query = "What Is Currently Outstanding Loans and Bonds?"
response = query_engine.query(query)
print(response)
# Currently, there are $6 trillion in outstanding loans and $1 trillion in outstanding bonds.
query = "What Is Value of Loans and Bonds Maturing After June 2023?"
response = query_engine.query(query)
print(response)
# The value of loans maturing after June 2023 is $3, and the value of bonds maturing after June 2023 is $0.3.
The answer for “LIBOR as of 2020” is not good (it always finds 2016 value). But it loads PDF file and converts to markdown format.
Appendix
Markdown benefit for LLM