Document Table Extraction
To Pandas Dataframe with Azure Document Intelligence, Amazon Textract
There are lots of documents containing structured form/table information, in an unstructured format. Converting them to pandas dataframe means that we can query them like structured data (database, csv, etc.).
Azure Document Intelligence (formerly Form Recognizer), Amazon Textract can extract text, structures from various document formats (Azure Document Intelligence, Amazon Textract). Azure Document Intelligence also supports different document types (e.g. Health insurance card, US tax forms, US mortgage forms, etc.), as well as Amazon Textract.
Both extraction result is JSON format. We need to extract useful information to construct Pandas dataframe.
Azure Document Intelligence
Use layout model to extract tables in the document
Below is the example.
# use pip install azure-ai-documentintelligence==1.0.0b2
result.tables contains all tables extracted. Other schema
Convert to pandas dataframe
The following is a simple snippet
def convert_azdocintl_todf(azdocintl_tables):
tablesCollected = []
for table_idx, table in enumerate(azdocintl_tables):
# Initialize an empty matrix
matrix = [["" for _ in range(table.column_count)] for _ in range(table.row_count)]
tablesCollected.append(matrix)
for cell in table.cells:
row_index = cell.row_index
column_index = cell.column_index
if row_index < table.row_count and column_index < table.column_count:
matrix[row_index][column_index] = cell.content
return tablesCollected
Amazon Textract
It has even easier method to convert table to pandas dataframe.
extractor = Textractor(profile_name="default")
document = extractor.analyze_document(
file_source=image,
features=[TextractFeatures.TABLES],
save_image=True
)
df = document.tables[0].to_pandas(use_columns=True)
The official package from Amazon contains the method. It is more robust than simple Azure Document Intelligence implementation above.
Appendix
Form Recognizer offers advanced features such as the ability to train custom models to improve accuracy.
Azure, AWS and GCP managed offerings for document extraction.
Additional features include extracting formulas into formulas collection, barcodes collection, language detected for each piece of text, query fields (extract key/value pair)
layout model can recognize document structures