Finally I have time to try out Apache Tika, which is a toolkit that extracts metadata and text from lots of file types. Here are some findings:
- Non-text PDF, nothing comes back
- Scanned PDF, only text in red box comes back
- JPG, good
- PNG, good
- Non-text PDF, convert to JPG then use tika, correct text comes back
Here are the steps:
Run Tika docker container
docker pull logicalspark/docker-tikaserver
# run the downloaded image on any port
docker run -d -p 9998:9998 logicalspark/docker-tikaserver
docker ps
docker exec -it <container> bash
Install java, python-tika
apt-get update
apt-get install python3-pip
pip3 install tika
apt-get install default-jdk
Parse file
import tika
from tika import parser
parsed = parser.from_file(<file path>)
print(parsed["metadata"])
print(parsed["content"])
Install pd2image
# install poppler-utils
apt-get install poppler-utils
pip install pdf2image
Convert PDF to image
from pdf2image import convert_from_path
import tempfile
with tempfile.TemporaryDirectory() as path:
pages = convert_from_path(<file path>)
for page in pages:
page.save('out.jpg', 'JPEG')
Run tika in spark
Install spark
pip3 install pyspark
export PYSPARK_PYTHON=python3
pyspark
Run tika in spark
df = spark.read.format("binaryFile").load("/tmp/binary/")
df.printSchema()
df.show()def extract_content(content):
import tika
from tika import parser
parsed = parser.from_buffer(content)
return parsed["content"].strip()df2 = df.rdd.map(lambda x: (x["path"], x["modificationTime"], x["length"], x["content"], extract_content(x["content"]))).toDF(["path", "modificationTime", "length", "content", "extractedcontent"])
df2.select("path","extractedcontent").show()
| path| extractedcontent|
+ — — — — — — — — — — + — — — — — — — — — — +
|file:/tmp/binary/…|Noisy image
to te…|
|file:/tmp/binary/…|Sample Scanned Im…|