In this tutorial we will take a closer look at pytesseract module and discover some of its powerful features. You will be able to understand basic optical character recognition in a very simple form.
We will also use PIL library for some image manipulation methods with Python, including: image opening, image displaying, image type conversion, etc.
TUTORIAL
Let’s start with importing the libraries we’re going to need.
import PIL
from PIL import Image
import pytesseract
Here is some info about PIL
NAME
PIL - Pillow (Fork of the Python Imaging Library)
DESCRIPTION
Pillow is the friendly PIL fork by Alex Clark and Contributors.
https://github.com/python-pillow/Pillow/
Here is some info about pytesseract
pytesseract is a very popular library for its optical character recognition capabilities. Sometimes, depending on your setup you might need an extra line for pytesseract to work properly. Just find your pytesseract installation directory and point to it with the code below. Note that directory can be different depending on your local setup and you may or may not have to exclude the last bit such as:
r”C:\Users\USA\Anaconda3\Tesseract-OCR\tesseract” or r”C:\Users\USA\Anaconda3\Tesseract-OCR\tesseract\tesseract.exe”
If we look at the Package Contents of pytesseract, you can see lot of different object you can discover. In this tutorial we will focus on image_to_string.
Help on image_to_string object seems quite simple and straightforward.
help(pytesseract.pytesseract.image_to_string)
Help on function image_to_string in module pytesseract.pytesseract:
image_to_string(image, lang=None, config=”, nice=0, output_type=’string’) Returns the result of a Tesseract OCR run on the provided image to string
f = r'c:/Users/t/Desktop/default.png'
img = Image.open(f)
img.show()
ACTUAL OCR PART
We’ve opened an image with text. Let’s start doing some OCR!
text = pytesseract.image_to_string(img)
print(text)
Output:
Holy Python
PYTHON HOLLINESS
CONCLUSION
Yes, OCR is that simple! Thanks to Python and Pytesseract.
OCR’s scope is deeper than this quick tutorial but this tutorial can get you started!
One simple technique that can be used when OCR is not very successful is to convert image to black and white using PIL library. This usually improves pytesseract’s reading abilities.
You will discover that image types such as: “RGB”, “RGBA”, “RGBa”, “1”, “L” can dictate methods you can and cannot use. Sometimes you might have to do image type conversions using .convert(type).
Also, text on the image can blend with the image and for many reasons it can be harder to extract so there are different methods and parameters to prepare the image for pytesseract such as binarization and converting it to black and white type.
We hope this quick tutorial will be eye opening and motivating to get you started to explore incredible OCR possibilities with Python.