How to do text recognition to crack picture CAPTCHA through Python 07/01 Update SLTechnology News&Howtos

How to do text recognition to crack picture CAPTCHA through Python

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to do Optical Character Recognition through Python to crack the image Captcha. Xiaobian thinks it is quite practical, so share it with everyone for reference. Let's follow Xiaobian and have a look.

preliminary preparation

1. To install the package, enter the pip command directly on the terminal:

#Send browser requests pip3 install requests#Optical Character Recognition pip3 install pytesseract#Image Processing pip3 install Pillow

2. new project

After the required modules are installed, create a new project wordsDistinguish.

Create three new.py files under the project package

test_pytesseract and test_pillow, case_verification.

test_pytesseract: Basic usage test of module pytesseract

test_pilot: Basic usage test of module Pillow

case_verification: actual combat case, crack website picture Captcha

involving intellectual

1. Image in Pillow

The most important class in Python's image library is Image, a class with the same name defined in the module.

Instances of this class can be created in a variety of ways; by loading images from files, working with other images, or creating images from scratch.

# -*- coding: utf-8-*-#Note: The import of print_function must precede Image, otherwise an error will be reported from __future__ import print_functionfrom PIL import Image"" Basic use of Image in pilot module ""# 1. Open image im = Image.open ("../ wordsDistinguish/test1.jpg")print(im)# 2. View image file contents print ("Image file format: "+im.format)print ("Image size: "+str(im.size))print ("Picture mode: "+im.mode)# 3. Display current picture object im.show()# 4. Modify picture size, format, save size = (50, 50)im.thumbnail(size)im.save ("1.jpg", "PNG")#5. Picture mode conversion and save, L means gray RGB means color im = im.convert("L")im.save("test1.jpg")

2. Pytesseract based on Tesseract-OCR

Python-tesseract is Python's Optical Character Recognition (OCR) tool. That is, it will recognize and "read" the text embedded in the image.

Python-tesseract is a wrapper around Google's Tesseract-OCR engine.

It is also useful as a stand-alone invocation script because it can read all image types supported by Pillow and Leptonica imaging libraries, including jpeg, png, gif, bmp, tiff, etc.

Also, if used as a script, Python-tesseract prints recognized text instead of writing it to a file.

To use pytesseract module on your computer, you also need to install Tesseract-OCR. I recommend using Homebrew to install this tool on Mac. After installation, enter the following command directly on the terminal:

Windows installation, then download the package directly, and then add it to the system environment variables (that is, add Path), more silly white sweet, you can Baidu.

# -*- coding: utf-8-*-#Import Image Processing Module from Pillow ImageFrom PIL import Image#Import Tesseract-based Optical Character Recognition Module pytesseractimport pytesseract ""@pytesseract: github.com/madmaze/pytesseract ""#Open Image im = Image.open ("../ wordsDistinguish/Resources/1.jpg")#Identify image content text = pytesseract.image_to_string(im)print(text)

1. preparation process

Login process needs to enter three data: account number, password, Captcha, first in the browser actually login once, press F12 to view the login process.

Enter your account password and Captcha, click Login, and note the changes in Network.

2. The code knocks.

Now the main difficulties in simulating the login process are: identification and transmission of Captcha.

a. Captcha identification We use the pytesseract module directly according to the previous knowledge. b. Login parameters are passed, and the post request can be sent by using the requests library. The problem is how to associate the Captcha with login.

From the previous analysis we know

Captcha is in

"https://so.gushiwen.org/RandCode.ashx,"

And the login page is

"https://so.gushiwen.org/user/login.aspx," analysis found.

The cookies of the normal browser login to these two URLs are consistent, and both have timestamps, so as long as the cookies of the two are consistent when the code is requested, we can use the session method of the requests library to achieve this.

# -*- coding: utf-8-*-#Import image processing module Imagefrom Pillow PIL import Image#Import Tesseract-based Optical Character Recognition module pytessertimport pytesseract#Import library requestsimport requests#Import regular library reimport re"" Simulate login, crack alphanumeric picture Captcha Target website: so.gushiwen.org ""#Request headers = { "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36" }#Create a session with requests, keep the cookie value the same twice session = requests.session()#Download identification Captcha image function def get_verification(): #Generate Captcha image url url = "https://so.gushiwen.org/RandCode.ashx" #Send a get request via session to get the Captcha resp = session.get(url, headers=headers) #Guarantee Captcha to Local with open(r"../ wordsDistinguish/Resources/test.jpg", 'wb') as f: f.write(resp.content) #Open Captcha image file im = Image.open(r"../ wordsDistinguish/Resources/test.jpg") #Basic processing, gray processing, improve recognition accuracy #Save processed images im.save("test.jpg") #Use pytesseract for image content recognition text = pytesseract.image_to_string(im) #Remove non-numeric/alphabetic content from recognition results text = re.sub("\W", "", text) #Return Captcha content return textdef do_login(): i = 0 #Number of identification errors #Get Captcha captcha = get_verification() #Basic check, Captcha must be 4 digits while len(captcha) != 4: captcha = get_verification() i = i + 1 # i+=1 print(" %d recognition error" % i) print("Start login, Captcha is: "+captcha) #Passed login parameters data = { "from": "http://so.gushiwen.org/user/collect.aspx", "email": "your registered email", "pwd": "your login password", "code": captcha, "denglu": "login" } #Login address url = "https://so.gushiwen.org/user/login.aspx" #Send post request using session response = session.post(url, headers=headers, data=data) #Print status code after login print(response.status_code) #Save the content of the login page to further confirm whether the login is successful with open("gsww.html", encoding="utf-8", mode="w") as f: f.write(response.content.decode())#Start program if __name__= "__main__": do_login()

3. operation results

a. The console displays a successful verification once, and the return status code is: 200, and the access is normal.

b. Further check, check the acquired source code

We observed the login page in the browser and found that only the login page had an account management module.

Among them, there is the unique identifier of the user: the last few digits of the binding mailbox, mine is 50471 @ qq.com.

Thank you for reading! About "how to do Optical Character Recognition to crack picture Captcha through Python" this article is shared here, I hope the above content can be of some help to everyone, so that everyone can learn more knowledge, if you think the article is good, you can share it to let more people see it!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.