In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to identify the text in the picture in Python, the quality of the article content is high, so Xiaobian shares it with you as a reference, I hope you have a certain understanding of relevant knowledge after reading this article.
1. Installation and configuration of Tesseract
Tesseract installation we can move to the website https://digi.bib.uni-mannheim.de/tesseract/, we can see the following interface:
There are many versions for everyone to choose from, and everyone can choose according to their own needs. Where w32 means 32-bit system, w64 means 64-bit system, we can choose the appropriate version.
The download speed may be slow, you can choose the link: pan.baidu.com/s/1jKZe_ACLQCVXiCmvHj9adw extraction code: ayel download. When installing, we need to know the location of our installation, configure the installation directory into the system path variable, our path is D:\CodeField\Tesseract-OCR.
We right-click My Computer/This Computer-> Properties-> Advanced System Settings-> Environment Variables->Path-> Edit-> New and copy our path into it. After adding the system variables, we need to click OK in turn, so that we are configured.
2. download a language pack
Tesseract does not support Chinese by default. If you want to recognize Chinese or other languages, you need to download the corresponding language pack. The download address is as follows: tesseract-ocr.github.io/tessdoc/Data-Files. After entering the website, we will scroll down:
There are two Chinese language packs, one Chinese-Simplified and Chinese-Traditional, which are simplified Chinese and traditional Chinese respectively. We can choose the one we need to download. After downloading, we need to put it in the tesseract directory under the Tesseract path, our path is D:\CodeField\Tesseract-OCR\tesserdata.
3. Other module downloads
In addition to the above steps, we also need to download two modules:
pip install pytesseract
pip install pillow
The first is for Optical Character Recognition and the second is for Picture Reading. Then we can do Optical Character Recognition.
III. Optical Character Recognition 1. Single Picture Recognition
The next operation is much simpler. Here are the pictures we want to identify:
Next is our code for Optical Character Recognition:
import pytesseract
from PIL import Image
#Read pictures
im = Image.open('sentence.jpg')
#Recognize text
string = pytesseract.image_to_string(im)
print(string)
The identification results are as follows:
Do not go gentle into that good night!
Since English is supported by default, we can recognize it directly, but we need to make some modifications when we want to recognize Chinese or other languages:
import pytesseract
from PIL import Image
#Read pictures
im = Image.open('sentence.png')
#Recognize text and specify language
string = pytesseract.image_to_string(im, lang='chi_sim')
print(string)
For recognition, we set lang='chi_sim', i.e. set the language to Simplified Chinese, which will only take effect if you have Simplified Chinese packages in your tessdata directory. Here are the pictures we used to identify:
The identification results are as follows:
Don't walk into that good night gently
The content of the image was accurately identified. One thing we need to know is that Tesseract recognizes English characters even when we set the language to Simplified Chinese or other languages.
2. Batch image recognition
Since we have listed the single image recognition, there must be a batch image recognition function, which requires us to prepare a txt file, for example, I have a text.txt file, the content is as follows:
sentence1.jpg
sentence2.jpg
We modify the code to read as follows:
import pytesseract
#Recognize text
string = pytesseract.image_to_string('text.txt', lang='chi_sim')
print(string)
However, it is inevitable that writing a txt file by ourselves is a bit troublesome, so we can make the following modifications:
import os
import pytesseract
#Path of text picture
path = 'text_img/'
#Get a picture path list
imgs = [path + i for i in os.listdir(path)]
#Open the file
f = open('text.txt', 'w+', encoding='utf-8')
#Write the path of each image to the text.txt file
for img in imgs:
f.write(img + '\n')
#Close the file
f.close()
#Optical Character Recognition
string = pytesseract.image_to_string('text.txt', lang='chi_sim')
print(string)
How to recognize the text in the picture in Python is shared here. I hope the above content can be of some help to everyone and learn more. If you think the article is good, you can share it so that more people can see it.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.