Document Table Extraction

To Pandas Dataframe with Azure Document Intelligence, Amazon Textract

3 min readMay 19, 2024

There are lots of documents containing structured form/table information, in an unstructured format. Converting them to pandas dataframe means that we can query them like structured data (database, csv, etc.).

Azure Document Intelligence (formerly Form Recognizer), Amazon Textract can extract text, structures from various document formats (Azure Document Intelligence, Amazon Textract). Azure Document Intelligence also supports different document types (e.g. Health insurance card, US tax forms, US mortgage forms, etc.), as well as Amazon Textract.

Both extraction result is JSON format. We need to extract useful information to construct Pandas dataframe.

Azure Document Intelligence

Use layout model to extract tables in the document

Below is the example.

Quickstart: Document Intelligence (formerly Form Recognizer) client libraries - Azure AI services

Use a Document Intelligence SDK or the REST API to create a forms processing app that extracts key data and structure…

learn.microsoft.com

# use pip install azure-ai-documentintelligence==1.0.0b2

result.tables contains all tables extracted. Other schema

Convert to pandas dataframe

The following is a simple snippet

def convert_azdocintl_todf(azdocintl_tables):
    tablesCollected = []
    for table_idx, table in enumerate(azdocintl_tables):
        # Initialize an empty matrix
        matrix = [["" for _ in range(table.column_count)] for _ in range(table.row_count)]
        tablesCollected.append(matrix)
        for cell in table.cells:
            row_index = cell.row_index
            column_index = cell.column_index

            if row_index < table.row_count and column_index < table.column_count:
                matrix[row_index][column_index] = cell.content
    
    return tablesCollected

Amazon Textract

It has even easier method to convert table to pandas dataframe.

Table data extraction to Excel - amazon-textract-textractor 1.0.0 documentation

There are various sets of dependencies available to tailor your installation to your use case. The base package will…

aws-samples.github.io

extractor = Textractor(profile_name="default")

document = extractor.analyze_document(
    file_source=image,
    features=[TextractFeatures.TABLES],
    save_image=True
)

df = document.tables[0].to_pandas(use_columns=True)

The official package from Amazon contains the method. It is more robust than simple Azure Document Intelligence implementation above.

Appendix

Comparison of AI-based Text Extraction Services

Text extraction is extracting text from documents such as PDFs or images. This process often uses artificial…

www.cloudthat.com

Form Recognizer offers advanced features such as the ability to train custom models to improve accuracy.

Azure, AWS and GCP Table Extraction

Dark Data

cogniflare.medium.com

Comparison of AI OCR Tools: Microsoft Azure AI Document Intelligence, Google Cloud Document AI, AWS…

The article compares six AI OCR tools: AWS Textract, Microsoft Azure Document Intelligence, Google Cloud Document AI…

persumi.com

Azure, AWS and GCP managed offerings for document extraction.

Add-on capabilities - Document Intelligence - Azure AI services

How to increase service limit capacity with add-on capabilities.

learn.microsoft.com

Additional features include extracting formulas into formulas collection, barcodes collection, language detected for each piece of text, query fields (extract key/value pair)

Document layout analysis model by Form Recognizer adds new structure insights

The new Form Recognizer 3.0's document layout analysis model extracts new structural insights like paragraphs, titles…

techcommunity.microsoft.com

layout model can recognize document structures

Unlocking Advanced Document Insights with Azure AI Document Intelligence

Discover Azure AI Document Intelligence's latest features for enhanced document analysis: hierarchical structure…

techcommunity.microsoft.com

Azure Document Intelligence code samples - Code Samples

Code samples for the Azure.AI.DocumentIntelligence client library.

learn.microsoft.com

ExtractThinker: AI Document Intelligence with LLMs

Unveil the future of document intelligence with ExtractThinker, combining ORMs and LLMs

pub.towardsai.net

Deep Diving into Strategies of Document AI: 30 day LLM Transformation Guide.

In the Part 1 of this series, we introduced the key problems with extracting data from PDFs and why it is a valuable…

medium.com

Part 3: Building the right benchmarks for Multimodal LLMs

This is part of the 30 day LLM Transformation guide and 3rd part of the miniseries within that — to analyze PDFs. See…

medium.com

Document Table Extraction

To Pandas Dataframe with Azure Document Intelligence, Amazon Textract

Azure Document Intelligence

Use layout model to extract tables in the document

Quickstart: Document Intelligence (formerly Form Recognizer) client libraries - Azure AI services

Use a Document Intelligence SDK or the REST API to create a forms processing app that extracts key data and structure…

Convert to pandas dataframe

Amazon Textract

Table data extraction to Excel - amazon-textract-textractor 1.0.0 documentation

There are various sets of dependencies available to tailor your installation to your use case. The base package will…

Appendix

Comparison of AI-based Text Extraction Services

Text extraction is extracting text from documents such as PDFs or images. This process often uses artificial…

Azure, AWS and GCP Table Extraction

Dark Data

Comparison of AI OCR Tools: Microsoft Azure AI Document Intelligence, Google Cloud Document AI, AWS…

The article compares six AI OCR tools: AWS Textract, Microsoft Azure Document Intelligence, Google Cloud Document AI…

Add-on capabilities - Document Intelligence - Azure AI services

How to increase service limit capacity with add-on capabilities.

Document layout analysis model by Form Recognizer adds new structure insights

The new Form Recognizer 3.0's document layout analysis model extracts new structural insights like paragraphs, titles…

Unlocking Advanced Document Insights with Azure AI Document Intelligence

Discover Azure AI Document Intelligence's latest features for enhanced document analysis: hierarchical structure…

Azure Document Intelligence code samples - Code Samples

Code samples for the Azure.AI.DocumentIntelligence client library.

ExtractThinker: AI Document Intelligence with LLMs

Unveil the future of document intelligence with ExtractThinker, combining ORMs and LLMs

Deep Diving into Strategies of Document AI: 30 day LLM Transformation Guide.

In the Part 1 of this series, we introduced the key problems with extracting data from PDFs and why it is a valuable…

Part 3: Building the right benchmarks for Multimodal LLMs

This is part of the 30 day LLM Transformation guide and 3rd part of the miniseries within that — to analyze PDFs. See…

Written by Xin Cheng

No responses yet