How to realize the functions of simulated login, automatic acquisition of cookie value and CAPTCHA recognition in python crawler 07/01 Update SLTechnology News&Howtos

How to realize the functions of simulated login, automatic acquisition of cookie value and CAPTCHA recognition in python crawler

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly shows you the "python crawler how to achieve simulated login, automatic acquisition of cookie values, CAPTCHA recognition function", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "how to achieve simulated login, automatic acquisition of cookie values, CAPTCHA recognition function" in the python crawler.

1. Analysis of crawling web pages

The target URL for crawling is: https://www.gushiwen.cn/

The work that needs to be done in the login interface is to obtain the CAPTCHA picture and identify the CAPTCHA in order to log in.

Using the browser grab tool, you can see that the login interface request header includes cookie and user-agent, so these two data are needed when sending the request. The user-agent can be added to the request header manually, and the cookie value needs to be obtained automatically.

After the analysis, the practice begins!

2. CAPTCHA recognition

(1) Identification of CAPTCHA based on online coding platform

(2) coding platform: Super Eagle, Yunda Code, coding Rabbit

The project is completed using the Super Eagle coding platform. Website link: https://www.chaojiying.com/

Encapsulate the downloaded source code

3. Cookie automatically obtains import requestsheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36", # manual processing cookie # "Cookie": "acw_tc=2760820816186635434807019e3f39e1bf4a8a9b9ad20b50586fb6c8184f56; xq_a_token=520e7bca78673752ed71e19b8820b5eb854123af; xqat=520e7bca78673752ed71e19b8820b5eb854123af; xq_r_token=598dda88240ff69f663261a3bf4ca3d9f9700cc0; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYyMTIxOTc0OSwiY3RtIjoxNjE4NjYzNTE1MDI1LCJjaWQiOiJkOWQwbjRBWnVwIn0.BGdEgnctB-rv0Xiu8TxrBEshPF4w0StKOE5jKTxy8OFz_pLwNl5VK9v2e8jyU4jaQt9xZTvgsPiYYbiIgmUUpPkamuT0pITHOFoNoKFYFz0syxQMuuAa93pPvSJxeCutqod4cvdWt6f4iRjtHyjAY0zVrv3xLi2ksc9noSf9sH3eLVu9Yjr3PzbF1QDzbXyQsX7oS5Y5Iwt2p-XartCGMlKWzWz9TPiFc3oZ6o7CMWu7Tvfb5D2XGlIU6L8wlPBMwoz2Zdy_zQif9itUqoBvQjNIa3E6UYag-vlY7nNFSDJh0UCobapBjdNITBVVvwFtYn6C-R16y6O8S5iko4E59g; upland 461618663543485" Hm_lvt_1db88642e346389874251b5a1eded6e3=1618663545; device_id=24700f9f1986800ab4fcc880530dd0ed Hm_lpvt_1db88642e346389874251b5a1eded6e3=1618664825 "} session = requests.Session () # create session object # use session for the first time Capture request cookieurl = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'page_text = session.get (url = url Headers = headers) .text# parsing verification code picture address tree = etree.HTML (page_text) img_src = 'https://so.gushiwen.cn/' + tree.xpath (' / * [@ id= "imgCode"] / @ src') [0] # Save the verification code picture to the local img_data = session.get (img_src,headers = headers). Contentwith open ('. / code.jpg' 'wb') as fp: fp.write (img_data) 4, program source code chaojiying.pyknoxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx (wb') init__ (self, username, password) Soft_id): self.username = username password = password.encode ('utf8') self.password = md5 (password). Hexdigest () self.soft_id = soft_id self.base_params = {' user': self.username, 'pass2': self.password,' softid': self.soft_id } self.headers = {'Connection':' Keep-Alive', 'User-Agent':' Mozilla/4.0 (compatible MSIE 8.0; Windows NT 5.1 Trident/4.0)',} def PostPic (self, im, codetype): "im: picture bytes codetype: topic types refer to http://www.chaojiying.com/price.html" params = {'codetype': codetype " } params.update (self.base_params) files= {'userfile': (' ccc.jpg', im)} r = requests.post ('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers) return r.json () def ReportError (self) Im_id): "im_id: the picture of the wrong title ID" params = {'id': im_id,} params.update (self.base_params) r = requests.post (' http://upload.chaojiying.net/Upload/ReportError.php', data=params ") Headers=self.headers) return r.json () if _ _ name__ = ='_ _ main__': chaojiying = Chaojiying_Client ('Super Eagle username', 'Super Eagle username password', '96001') # user Center > > Software ID generates a replacement 96001 im = open ('a.jpg' 'rb') .read () # Local image file path to replace a.jpg sometimes WIN system requires / / print (chaojiying.PostPic (im) 1902) # 1902 CAPTCHA type official website > > Price system 3.4 + version print add () sign in.py# simulated login # process: 1 send (post request) # 2 to the request corresponding to the click on the login button Processing request parameters: # username password verification code other anti-counterfeiting parameters import requestsfrom lxml import etreefrom chaojiying_Python.chaojiying import Chaojiying_Clientheaders = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0) Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36 ",} # packaged CAPTCHA identification function def tranformImgCode (imgPath,imgType): chaojiying = Chaojiying_Client ('username', 'password', 'software ID') # user center > > software ID generates a replacement software ID im = open (imgPath 'rb') .read () # Local image file path to replace a.jpg sometimes WIN system requires / / return (chaojiying.PostPic (im) ImgType) ['pic_str'] # automatically get cookiesession = requests.Session () # Authentication code url =' https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'page_text = session.get (url = url) Headers = headers) .text# parsing verification code picture address tree = etree.HTML (page_text) img_src = 'https://so.gushiwen.cn/' + tree.xpath (' / * [@ id= "imgCode"] / @ src') [0] # Save the verification code picture to the local img_data = session.get (img_src,headers = headers). Contentwith open ('. / code.jpg' 'wb') as fp: fp.write (img_data) # Identification code code_text = tranformImgCode ('. / code.jpg',1902) print (code_text) login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'data = {' _ _ VIEWSTATE': 'frn5Bnnr5HRYCoJJ9fIlFFjsta310405ClDr+hy0/V9dyMGgBf34A2YjI8iCAaXHZarltdz1LPU8hGWIAUP9y5eLjxKeYaJxouGAa4YcCPC+qLQstMsdpWvKGjg=',' _ _ VIEWSTATEGENERATOR': 'C93BE1AE' 'from':' http://so.gushiwen.cn/user/collect.aspx', 'email':' username', # change your username 'pwd':' password, # change your password 'code': code_text,' denglu': 'login'} # initiate a request for clicking the login button Get the page source data page_text_login = session.post (url = login_url,data = data,headers = headers) after successful login. Textwith open ('. / gushiwen.html','w',encoding = 'utf-8') as fp: fp.write (page_text_login) these are all the contents of the article "how to achieve simulated login, automatic acquisition of cookie values and CAPTCHA recognition functions in python crawlers" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.