I want to perform OCR on images like this one:
It is a table with numerical data with colons as decimal separators.
It is not noisy, contrast is good, black text on white background.
As an additional preprocessing step, in order to get around issues with the frame borders, I cut out every cell, binarize it, pad it with a white border (to prevent edge issues) and pass only that single cell image to tesseract
.
I also looked at the individual cell images to make sure the cutting process works as expected and does not produce artifacts. These are two examples of the input images for tesseract
:
Unfortunately, tesseract
is unable to parse these consistently. I have found no configuration where all 36 values are recognized correctly.
There exist a couple similar questions here on SO and the usual answer is a suggestion for a specific combination of the --oem
and --psm
parameters. So I wrote a python script with pytesseract that loops over all combinations of --oem
from 0 to 3 and all values of --psm
from 0 to 13 as well als lang=eng
and lang=deu
. I ignored the combinations that throw errors.
Example 1: With --psm 13 --oem 3
the above "1,7" image is misidentified as "4,7", but the "57" image is correctly recognized as "57".
Example 2: With --psm 6 --oem 3
the above "1,7" image is correctly recognized as "1,7", but the "57" image is misidentified as "o/".
Any suggestions what else might be helpful in improving the output quality of tesseract here?
My tesseract version:
tesseract v4.0.0.20190314
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.2.0
Found AVX2
Found AVX
Found SSE
question from:https://stackoverflow.com/questions/65845004/tesseract-fails-at-simple-number-detection