In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail about Python crawler cracking Captcha identification and pop-up processing, Xiaobian thinks it is quite practical, so share it with you for reference, I hope you can gain something after reading this article.
preface
In the process of writing crawler, the common interference means of target website is to set Captcha, etc. This will explain how to deal with pop-up window and Captcha based on Selenium actual combat. The target website crawled is an instrument reservation platform
You can see that the Captcha required for login is relatively simple. It is a color standard number combined with simple background interference.
Therefore, the Captcha recognition here does not need artificial intelligence, but can be directly processed by binary method and handed over to Google's recognition engine tesseract-OCR to obtain the numbers in the picture.
Python in action
Import the required modules first
import re#Image processing from PIL import Image#Optical Character Recognition import pytesseract#Browser Automation from selenium import webdriverimport time Resolve pop-up box issues
Try opening the sample website first
url = 'http://lims.gzzoc.com/client'driver = webdriver.Chrome()driver.get(url)time.sleep(30)
Interesting place appears, the site shows a pop-up window we did not see before, a brief mention of the pop-up knowledge point, beginners can be divided into alert and non-alert pop-up box simply
alert pop-up box
The alert(message) method displays an alert box with a specified message and an OK button
The confirm(message) method displays a dialog box with a specified message and OK and Cancel buttons
prompt(text,defaultText) method is used to display dialog boxes that prompt the user for input
Look at how the js of this pop-up box is written:
It looks like an alert pop-up box, so do you use driver.switch_to.alert directly? there's no rush
Processing of non-traditional alert pop-up box
Pop-up box is located in div layer, same as usual positioning method
Pop-up boxes are nested iframe layers and need to switch iframes
Pop-up box is located in nested handle, need to switch window
So let's do an element review of this pop-up.
So the problem is actually very simple, just locate the button and click.
url = 'http://lims.gzzoc.com/client'driver = webdriver.Chrome()driver.get(url)time.sleep(1)driver.maximize_window() #maximize window driver.find_element_by_xpath("//div[@class ='jconfirm-buttons ']/button").click() Get the image location and screenshot
The simple idea of binary processing Captcha is as follows:
Cut and capture the picture where the Captcha is located
After conversion to gray, binary-valued method converts valid information to black, background and interference to white
The processed picture is handed over to the Optical Character Recognition engine
Enter the returned results and submit
Cut and capture the picture of Captcha. Further think about the solution strategy: first get the css attribute of the picture on the web page, calculate the coordinates of the picture according to the size and location; then take a screenshot; finally use this coordinate to further process the screenshot (due to the particularity of Captcha js, you can't simply get the href of img and download the picture to read and identify, which will lead to mismatch before and after)
img = driver.find_element_by_xpath('//img[@id="valiCode"]')time.sleep(1)location = img.locationsize = img.size# left = location['x']# top = location['y']# right = left + size['width']# bottom = top + size['height']left = 2 * location['x']top = 2 * location['y']right = left + 2 * size['width'] - 10bottom = top + 2 * size['height'] - 10driver.save_screenshot('valicode.png')page_snap_obj = Image.open('valicode.png')image_obj = page_snap_obj.crop((left, top, right, bottom))image_obj.show()
Under normal circumstances, you can directly use the four-line code of the comment, but different computers and different browsers have different zoom ratios, so if there is deviation in the captured map, you need to consider multiplying the magnification factor. Finally, you can add or subtract values for fine tuning
You can see the picture here successfully captured!
Further processing of Captcha images
This threshold needs to be specifically tried with Photoshop or other tools, that is, to find a pixel threshold that can separate real data from background interference in grayscale images. In this example, the tested threshold is 205.
img = image_obj.convert("L") #to grayscale pixdata = img.load()w, h = img. sizeththreshold = 205#traverse all pixels, black above threshold for y in range(h): for x in range(w): if pixdata[x, y]
< threshold: pixdata[x, y] = 0 else: pixdata[x, y] = 255 根据像素二值结果重新生成图片 data = img.getdata()w, h = img.sizeblack_point = 0for x in range(1, w - 1): for y in range(1, h - 1): mid_pixel = data[w * y + x] if mid_pixel < 50: top_pixel = data[w * (y - 1) + x] left_pixel = data[w * y + (x - 1)] down_pixel = data[w * (y + 1) + x] right_pixel = data[w * y + (x + 1)] if top_pixel < 10: black_point += 1 if left_pixel < 10: black_point += 1 if down_pixel < 10: black_point += 1 if right_pixel < 10: black_point += 1 if black_point < 1: img.putpixel((x, y), 255) black_point = 0img.show() 图像处理前后对比如下Optical Character Recognition
The processed image is given to Google's Optical Character Recognition engine to complete the recognition
result = pytesseract.image_to_string(img)#There may be abnormal symbols, use regular to extract the numbers regex = '\d+'result = ''.join(re.findall(regex, result))print(result)
The identification results are as follows
Submit account password, Captcha and other information
After processing Captcha, we can now submit account password, Captcha and other information required for login to the website
driver.find_element_by_name('code').send_keys(result)driver.find_element_by_name('userName').send_keys('xxx')driver.find_element_by_name('password').send_keys('xxx')#last click identify driver.find_element_by_xpath("//div[@class='form-group login-input'][3]").click()
It should be noted that the success rate of binary method to identify Captcha is not 100%. Therefore, it is necessary to consider the recognition error of Captcha. Click the picture to replace Captcha for re-identification. After the above code can be disassembled into multiple functions, try and error with the following cyclic framework
while True: try: ... break except: driver.find_element_by_id ('valiCode ').click() About "Python crawler cracking Captcha identification and pop-up processing" This article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.