Document Table Extraction

To Pandas Dataframe with Azure Document Intelligence, Amazon Textract

Xin Cheng
3 min readMay 19, 2024

There are lots of documents containing structured form/table information, in an unstructured format. Converting them to pandas dataframe means that we can query them like structured data (database, csv, etc.).

Azure Document Intelligence (formerly Form Recognizer), Amazon Textract can extract text, structures from various document formats (Azure Document Intelligence, Amazon Textract). Azure Document Intelligence also supports different document types (e.g. Health insurance card, US tax forms, US mortgage forms, etc.), as well as Amazon Textract.

Both extraction result is JSON format. We need to extract useful information to construct Pandas dataframe.

Azure Document Intelligence

Use layout model to extract tables in the document

Below is the example.

# use pip install azure-ai-documentintelligence==1.0.0b2

result.tables contains all tables extracted. Other schema

Convert to pandas dataframe

The following is a simple snippet

def convert_azdocintl_todf(azdocintl_tables):
tablesCollected = []
for table_idx, table in enumerate(azdocintl_tables):
# Initialize an empty matrix
matrix = [["" for _ in range(table.column_count)] for _ in range(table.row_count)]
tablesCollected.append(matrix)
for cell in table.cells:
row_index = cell.row_index
column_index = cell.column_index

if row_index < table.row_count and column_index < table.column_count:
matrix[row_index][column_index] = cell.content

return tablesCollected

Amazon Textract

It has even easier method to convert table to pandas dataframe.

extractor = Textractor(profile_name="default")

document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.TABLES],
save_image=True
)
df = document.tables[0].to_pandas(use_columns=True)

The official package from Amazon contains the method. It is more robust than simple Azure Document Intelligence implementation above.

Appendix

Form Recognizer offers advanced features such as the ability to train custom models to improve accuracy.

Azure, AWS and GCP managed offerings for document extraction.

Additional features include extracting formulas into formulas collection, barcodes collection, language detected for each piece of text, query fields (extract key/value pair)

layout model can recognize document structures

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet