Tesseract OCR tips

Xin Cheng
2 min readJul 23, 2020

--

Recently I did some performance tuning for OCR applications. Would like to document tips that we did not find online.

  1. OMP_THREAD_LIMIT
    By default, tesseract uses 4 threads. This helps a bit when you want to process one image at one time (15%-20% performance improvement vs 1 thread according to our test). However, if you want to process multiple images, you need to check how many cores your system has. It is best to keep number of images at one time * number of threads close to the number of cpu cores. tesseract is very resource-intensive, so you want to avoid too much resource contention.
  2. image file truncated error

All articles we found online says using “ImageFile.LOAD_TRUNCATED_IMAGES = True".However, they don’t explain why it happens and if it means losing OCR quality. Interesting, the issue is somewhat related to disk space. If you do parallel processing, make sure you have enough disk space (the error is not intuitive).

References

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet