In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains the "Python pop-up window processing and CAPTCHA recognition method", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "Python pop-up window processing and CAPTCHA recognition method"!
Preface
In the process of writing crawlers, the common interference means of target websites is to set CAPTCHA and so on. Based on Selenium, we will explain how to deal with pop-up windows and CAPTCHA, and crawl the target website to reserve a platform for an instrument.
You can see that the structure of the verification code required for login is relatively simple, and it is a color standard number combined with simple background interference.
Therefore, the CAPTCHA recognition here does not need the help of artificial intelligence, but can directly use the binary method to process the image and give it to Google's recognition engine tesseract-OCR to get the numbers in the picture.
Note: the configuration of selenium and tesseract can be searched by readers themselves, which will not be introduced in this article)
Python actual combat
First import the required modules
Import re
# Image processing
From PIL import Image
# character recognition
Import pytesseract
# browser Automation
From selenium import webdriver
Import time solves pop-up box problem
Try to open the sample website first
Url = 'http://lims.gzzoc.com/client'
Driver = webdriver.Chrome ()
Driver.get (url)
Time.sleep (30)
What's interesting is that the website shows a pop-up window that we didn't see before. To briefly talk about the knowledge of the pop-up window, beginners can simply divide the pop-up box into alert and non-alert.
Alert pop-up box
The alert (message) method is used to display an alert box with a specified message and an OK button. The confirm (message) method is used to display a dialog box with a specified message, OK and cancel button. The prompt (text,defaultText) method is used to display a dialog box that prompts the user for input.
Take a look at the js of the pop-up box:
It looks like an alert pop-up box, so do you use driver.switch_to.alert directly? Take your time.
Treatment of non-traditional alert pop-up box
The pop-up box is located in the div layer. Like the usual positioning method, the pop-up box is a nested iframe layer. You need to switch the iframe pop-up box to the nested handle, and you need to switch windows.
So we do an element review of this pop-up box.
So the problem is actually very simple, just locate the button and click it.
Url = 'http://lims.gzzoc.com/client'
Driver = webdriver.Chrome ()
Driver.get (url)
Time.sleep (1)
Driver.maximize_window () # maximize window
Driver.find_element_by_xpath ("/ / div [@ class='jconfirm-buttons'] / button") .click () gets the location of the picture and takes a screenshot
The simple idea of dealing with CAPTCHA by binary method is as follows:
After cutting and intercepting the picture where the CAPTCHA is located, the effective information is converted to black by the binary method, and the background and interference are converted to white to the character recognition engine to input and submit the returned results.
Cut and capture the image of the verification code to further think about the solution: first, obtain the css attribute of the picture on the web page, and then calculate the coordinates of the picture according to size and location; then take a screenshot; finally, use this coordinate to further deal with the screenshot (due to the particularity of the verification code js, you can not simply obtain the href of img and then download the picture and read the recognition, which will lead to mismatch)
Img = driver.find_element_by_xpath ('/ / img [@ id= "valiCode"]')
Time.sleep (1)
Location = img.location
Size = img.size
# left = location ['x']
# top = location ['y']
# right = left + size ['width']
# bottom = top + size ['height']
Left = 2 * location ['x']
Top = 2 * location ['y']
Right = left + 2 * size ['width']-10
Bottom = top + 2 * size ['height']-10
Driver.save_screenshot ('valicode.png')
Page_snap_obj = Image.open ('valicode.png')
Image_obj = page_snap_obj.crop ((left, top, right, bottom))
Image_obj.show ()
Under normal circumstances, you can directly use the four lines of annotated code, but different browsers on different computers have different magnification, so if there is a deviation in the intercepted image, you need to consider multiplying the magnification coefficient. Finally, you can add and subtract the value to fine-tune it.
You can see that the picture has been successfully intercepted!
Further processing of CAPTCHA pictures
This threshold needs to be tried with Photoshop or other tools, that is, to find a pixel threshold that can separate the real data from the background interference in the grayscale image.
Img = image_obj.convert ("L") # convert grayscale image
Pixdata = img.load ()
W, h = img.size
Threshold = 205
# iterate through all pixels, and those greater than the threshold are black
For y in range (h):
For x in range (w):
If pixdata [x, y] < threshold:
Pixdata [x, y] = 0
Else:
Pixdata [x, y] = 255
Regenerate the picture according to the binary result of pixels
Data = img.getdata ()
W, h = img.size
Black_point = 0
For x in range (1, w-1):
For y in range (1, h-1):
Mid_pixel = data [w * y + x]
If mid_pixel < 50:
Top_pixel = data [w * (y-1) + x]
Left_pixel = data [w * y + (x-1)]
Down_pixel = data [w * (y + 1) + x]
Right_pixel = data [w * y + (x + 1)]
If top_pixel < 10:
Black_point + = 1
If left_pixel < 10:
Black_point + = 1
If down_pixel < 10:
Black_point + = 1
If right_pixel < 10:
Black_point + = 1
If black_point < 1:
Img.putpixel (x, y), 255
Black_point = 0
Img.show ()
The comparison before and after image processing is as follows
Character recognition
The processed image will be given to Google's character recognition engine to complete the recognition.
Result = pytesseract.image_to_string (img)
# there may be abnormal symbols. Extract the numbers with rules.
Regex ='\ dcats'
Result = '.join (re.findall (regex, result))
Print (result)
The recognition results are as follows
Submit account password, verification code and other information
After processing the CAPTCHA, we can now submit the account password, CAPTCHA and other information needed for login to the website.
Driver.find_element_by_name ('code'). Send_keys (result)
Driver.find_element_by_name ('userName'). Send_keys (' xxx')
Driver.find_element_by_name ('password'). Send_keys (' xxx')
# finally click OK
Driver.find_element_by_xpath ("/ / div [@ class='form-group login-input'] [3]") .click ()
It should be noted that the success rate of the binary method to identify the CAPTCHA is not 100%, so you need to consider the CAPTCHA recognition error, and you need to click the picture to change the CAPTCHA to re-identify it. After you can disassemble the above code into multiple functions, use the following loop framework to try and make mistakes.
While True:
Try:
...
Break
Except:
Driver.find_element_by_id ('valiCode') .click ()
In order to make it easy to understand, the writing of the code is not presented in the form of a function, readers are welcome to try to modify it!
Summary
After successfully logging in, you can get a personal cookies, and then you can continue to automate the browser with selenium or transfer cookies to requests, and then you can crawl the required information for analysis or achieve some automation functions. However, as there are many crawler knowledge points involved, we will share them in the following crawler feature articles!
Thank you for your reading, the above is the content of "Python pop-up window processing and CAPTCHA recognition method". After the study of this article, I believe you have a deeper understanding of the method of Python pop-up window processing and CAPTCHA recognition, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.