Apache Tika and Spark

Xin Cheng
Nov 11, 2020

--

Finally I have time to try out Apache Tika, which is a toolkit that extracts metadata and text from lots of file types. Here are some findings:

  1. Non-text PDF, nothing comes back
  2. Scanned PDF, only text in red box comes back
  3. JPG, good
  4. PNG, good
  5. Non-text PDF, convert to JPG then use tika, correct text comes back

Here are the steps:

Run Tika docker container

docker pull logicalspark/docker-tikaserver
# run the downloaded image on any port
docker run -d -p 9998:9998 logicalspark/docker-tikaserver
docker ps
docker exec -it <container> bash

Install java, python-tika

apt-get update
apt-get install python3-pip
pip3 install tika
apt-get install default-jdk

Parse file

import tika
from tika import parser
parsed = parser.from_file(<file path>)
print(parsed["metadata"])
print(parsed["content"])

Install pd2image

# install poppler-utils
apt-get install poppler-utils
pip install pdf2image

Convert PDF to image

from pdf2image import convert_from_path
import tempfile
with tempfile.TemporaryDirectory() as path:
pages = convert_from_path(<file path>)
for page in pages:
page.save('out.jpg', 'JPEG')

Run tika in spark

Install spark

pip3 install pyspark
export PYSPARK_PYTHON=python3
pyspark

Run tika in spark

df = spark.read.format("binaryFile").load("/tmp/binary/")
df.printSchema()
df.show()
def extract_content(content):
import tika
from tika import parser
parsed = parser.from_buffer(content)
return parsed["content"].strip()
df2 = df.rdd.map(lambda x: (x["path"], x["modificationTime"], x["length"], x["content"], extract_content(x["content"]))).toDF(["path", "modificationTime", "length", "content", "extractedcontent"])
df2.select("path","extractedcontent").show()

| path| extractedcontent|
+ — — — — — — — — — — + — — — — — — — — — — +
|file:/tmp/binary/…|Noisy image
to te…|
|file:/tmp/binary/…|Sample Scanned Im…|

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet