Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the ways in which Python crawlers encounter CAPTCHA?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "what are the ways in which Python reptiles encounter CAPTCHA?". In daily operation, I believe that many people have doubts about the handling of CAPTCHA encountered by Python crawler. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt that "Python crawler encounters CAPTCHA processing methods." Next, please follow the editor to study!

Encapsulate the source code:

Learn to call Baidu's aip interface:

1. First, you need to sign up for an account:

Https://login.bce.baidu.com/

Log in after registration

two。 Create a project

Find text recognition in these technologies, and then click to create a project.

After the creation is complete:

AppID, API key and Secret Key in the picture are needed later.

Next, you can check the documentation on the official website, or use the code I wrote directly.

3. Install the dependency library pip install baidu-aip

This is just an interface and requires some of the previous settings.

Def return_ocr_by_baidu (self, test_image): "" ps: first set some parameters of your own baidu_aip in the _ _ init__ function. This test uses a high-precision version of the test. If the speed is very slow, you can switch back to the normal version of self.client.basicGeneral (image). Options) related reference URL: https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa: param test_image: file name to be tested: return: returns the recognition effect of this CAPTCHA. If there is an error, you can call "image = self.return_image_content" multiple times. (test_image=self.return_path (test_image)) # call universal character recognition (high precision version) # self.client.basicAccurate (image) # if you have optional parameters, you can find options = {} options ["detect_direction"] = "true" options ["probability"] = "true" in the URL above. # call result = self.client.basicAccurate (image Options) result_s = result ['words_result'] [0] [' words'] # No printing off print (result_s) if result_s: return result_s.strip () else: raise Exception ("The result is None, try it!")

Expand Baidu's porn identification interface:

We must write code to have some fun. It can't be so boring, can it?

Porn identification interface in the content review, just look for it.

Call method source code:

#-*-coding: utf-8-*-# @ Time: 17:30 on 2020-10-22 # @ author: the hourglass is raining # @ Software: PyCharm # @ CSDN: https://me.csdn.net/qq_45906219 from aip import AipContentCensor from ocr import MyOrc class Auditing (MyOrc): "this is the host of the aip API that calls Baidu content audit. To review some pornographic, counter-terrorism and disgusting things, the website: https://ai.baidu.com/ai-doc/ANTIPORN/tk3h7xgkn "" def _ _ init__ (self): # super (). _ _ init__ () APP_ID = 'fill in your ID' API_KEY =' fill in your KEY' SECRET_KEY = 'fill in Write your SECRET_KEY' self.client = AipContentCensor (APP_ID) API_KEY, SECRET_KEY) def return_path (self, test_image): return super (). Return_path (test_image) def return_image_content (self, test_image): return super (). Return_image_content (test_image) def return_Content_by_baidu_of_image (self, test_image, mode=0): "" inherits some methods in ocr Because they are all put together, there is less code content review: about whether there is some illegal and bad information in the picture, the content audit can also achieve text review. I think it's a bit of a chicken rib, so it's not encapsulated together. Url: https://ai.baidu.com/ai-doc/ANTIPORN/Wk3h7xg56: param test_image: the image to be tested. Either local file or URL: param mode: default = 0 means recognized local file mode = 1 indicates recognized picture URL link: return: return recognition result "if mode = = 0: filepath = self.return_image_content (self.return_path (test_image=test_image)) elif mode = = 1 : filepath = test_image else: raise Exception ("The mode is 0 or 1 but your mode is" Mode) # call porn recognition API result = self.client.imageCensorUserDefined (filepath) # "if the picture is url, call the following" # result = self.client.imageCensorUserDefined ('http://www.example.com/image.jpg') print (result) return result a = Auditing () a.return_Content_by_baidu_of_image ("test_image/2.jpg", mode=0)

Learn the muggle_ocr recognition interface:

This package is popular recently, it is easy to use, and there are not many other functions.

It is a bit slow to install pip install muggle-ocr. It is best to use mobile hotspots. Currently, the mirror website (Tsinghua / Ali) has not been updated to this package because this package is the latest ocr model 12.

Call interface

Def return_ocr_by_muggle (self, test_image Mode=1): call this function to use muggle_ocr to identify: param test_image the best absolute path for the name of the file to be tested: param model mode= 0, that is, ModelType.OCR means to recognize ordinary printed text. When mode=1 defaults to ModelType.Captcha, it means to recognize 4-6 bits of simple English input. Verification code official website: https://pypi.org/project/muggle-ocr/: return: return the recognition result of this verification code. If there is an error, you can call "# to identify the item if mode = = 1: sdk = muggle_ocr.SDK (model_type=muggle_ocr.ModelType.Captcha) elif. Mode = = 0: sdk = muggle_ocr.SDK (model_type=muggle_ocr.ModelType.OCR) else: raise Exception ("The mode is 0 or 1 But your mode = ", mode) filepath = self.return_path (test_image=test_image) with open (filepath, 'rb') as fr: captcha_bytes = fr.read () result = sdk.predict (image_bytes=captcha_bytes) # do not print off print (result) return result.strip ()

Encapsulate the source code:

#-*-coding: utf-8-*-# @ Time: 2020-10-22 14:12 # @ author: the hourglass is raining # @ Software: PyCharm # @ CSDN: https://me.csdn.net/qq_45906219 import muggle_ocr import os from aip import AipOcr "PS: this function is mainly to encapsulate 2 commonly used images / verify How to use the code recognition method together depends on your own interface 1: muggle_ocr pip install muggle-ocr this download is a bit slow and it is best to use mobile phone hotspots. Currently, the mirror website (Tsinghua / Ali) has not been updated to this package because this package is the latest ocr model interface 2: baidu-aip. Pip install baidu-aip, there should be a lot of people who know. But I think muggle, the new package, can be called by referring to the official website documentation: https://cloud.baidu.com/doc/OCR/index.html or I use the following methods are ok: param image_path image path to be identified if the directory is very deep, it is recommended to use the absolute path "" class MyOrc: def " _ _ init__ (self): # set some necessary information to use your own Baidu aip content APP_ID = 'your ID' API_KEY =' your KEY' SECRET_KEY = 'your SECRET_KEY' self.client = AipOcr (APP_ID) API_KEY, SECRET_KEY) def return_path (self, test_image): ": return abs image_path" # determine the path if os.path.isabs (test_image): filepath = test_image else: filepath = os.path.abspath (test_image) return filepath def return_image_content (self Test_image): ": return the image content"with open (test_image, 'rb') as fr: return fr.read () def return_ocr_by_baidu (self) Test_image): "ps: first set some parameters of your own baidu_aip in the _ _ init__ function. This test uses a high-precision version of the test. If the speed is very slow, you can switch back to the normal version of self.client.basicGeneral (image). Options) related reference URL: https://cloud.baidu.com/doc/OCR/s/3k3h7yeqa: param test_image: file name to be tested: return: returns the recognition effect of this CAPTCHA. If there is an error, you can call "image = self.return_image_content" multiple times. (test_image=self.return_path (test_image)) # call universal character recognition (high precision version) # self.client.basicAccurate (image) # if you have optional parameters, you can find options = {} options ["detect_direction"] = "true" options ["probability"] = "true" in the URL above. # call result = self.client.basicAccurate (image Options) result_s = result ['words_result'] [0] [' words'] # No printing off print (result_s) if result_s: return result_s.strip () else: raise Exception ("The result is None, try it!") Def return_ocr_by_muggle (self, test_image Mode=1): call this function to use muggle_ocr to identify: param test_image the best absolute path for the name of the file to be tested: param model mode= 0, that is, ModelType.OCR means to recognize ordinary printed text, while mode=1 defaults to ModelType.Captcha means to recognize 4-6 Bit simple English verification code official website: https://pypi.org/project/muggle-ocr/: return: return the recognition result of this verification code. If there is an error, you can call "" # to identify the item if mode = = 1: sdk = muggle_ocr.SDK (model_type=muggle_ocr.ModelType). .Captcha) elif mode = = 0: sdk = muggle_ocr.SDK (model_type=muggle_ocr.ModelType.OCR) else: raise Exception ("The mode is 0 or 1 But your mode = ", mode) filepath = self.return_path (test_image=test_image) with open (filepath) 'rb') as fr: captcha_bytes = fr.read () result = sdk.predict (image_bytes=captcha_bytes) # do not print off print (result) return result.strip () # a = MyOrc () # a.return_ocr_by_baidu (test_image='test_image/digit_img_1.png') to this point On the "Python crawler encountered CAPTCHA processing what is the end of the study, I hope to be able to solve your doubts." The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report