ABSTRACT

  • In this tutorial we will take a closer look at pytesseract module and discover some of its powerful features. You will be able to understand basic optical character recognition in a very simple form.

  • We will also use PIL library for some image manipulation methods with Python, including: image opening, image displaying, image type conversion, etc.

TUTORIAL

Let’s start with importing the libraries we’re going to need.

import PIL
from PIL import Image
import pytesseract
		

Here is some info about PIL

NAME
    PIL - Pillow (Fork of the Python Imaging Library)

DESCRIPTION
    Pillow is the friendly PIL fork by Alex Clark and Contributors.
        https://github.com/python-pillow/Pillow/

Here is some info about pytesseract

pytesseract is a very popular library for its optical character recognition capabilities. Sometimes, depending on your setup you might need an extra line for pytesseract to work properly. Just find your pytesseract installation directory and point to it with the code below. Note that directory can be different depending on your local setup and you may or may not have to exclude the last bit such as:

r”C:\Users\USA\Anaconda3\Tesseract-OCR\tesseract” or r”C:\Users\USA\Anaconda3\Tesseract-OCR\tesseract\tesseract.exe”

Here is the code:

pytesseract.pytesseract.tesseract_cmd = r"C:\Users\USA\Anaconda3\Tesseract-OCR\tesseract\tesseract.exe"
		
print(dir(pytesseract.pytesseract))
		
If we look at the Package Contents of pytesseract, you can see lot of different object you can discover. In this tutorial we will focus on image_to_string.

BytesIO
Image
LooseVersion
OSD_KEYS
Output
PandasNotSupported
QUOTE_NONE
RGB_MODE
TSVNotSupported
TesseractError
TesseractNotFoundError
__builtins__
__cached__
__doc__
__file__
__loader__
__name__
__package__
__spec__
cleanup
file_to_dict
find_loader
get_errors
get_pandas_output
get_tesseract_version
iglob
image_to_boxes
image_to_data
image_to_osd
image_to_pdf_or_hocr
image_to_string
is_valid
main
ndarray
normcase
normpath
numpy_installed
os
osd_to_dict
pandas_installed
pd
prepare
realpath
run_and_get_output
run_once
run_tesseract
save_image
shlex
string
subprocess
subprocess_args
sys
tempfile
tesseract_cmd
wraps

Help on image_to_string object seems quite simple and straightforward.

help(pytesseract.pytesseract.image_to_string)
		

Help on function image_to_string in module pytesseract.pytesseract:

image_to_string(image, lang=None, config=”, nice=0, output_type=’string’)
Returns the result of a Tesseract OCR run on the provided image to string

f = r'c:/Users/t/Desktop/default.png'
img = Image.open(f)
img.show()
		

ACTUAL OCR PART

We’ve opened an image with text. Let’s start doing some OCR!

text = pytesseract.image_to_string(img)
print(text)
		

Output:

Holy Python

PYTHON HOLLINESS

CONCLUSION

Yes, OCR is that simple! Thanks to Python and Pytesseract. 

OCR’s scope is deeper than this quick tutorial but this tutorial can get you started!

  • One simple technique that can be used when OCR is not very successful is to convert image to black and white using PIL library. This usually improves pytesseract’s reading abilities.
  • You will discover that image types such as: “RGB”, “RGBA”,  “RGBa”, “1”, “L” can dictate methods you can and cannot use. Sometimes you might have to do image type conversions using .convert(type).
  • Also, text on the image can blend with the image and for many reasons it can be harder to extract so there are different methods and parameters to prepare the image for pytesseract such as binarization and converting it to black and white type.

We hope this quick tutorial will be eye opening and motivating to get you started to explore incredible OCR possibilities with Python.

Recommended Posts