Scrapy crawl Zhizhong how to simulate login 04/28 Update SLTechnology News&Howtos

Scrapy crawl Zhizhong how to simulate login

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you Scrapy crawling Zhihu in how to simulate login, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Starting from today to update a series of articles about crawling Zhihu, recently has been optimizing the code, but what agent IP is useful is for money, so I don't know how to optimize it, send it out for reference, and give some suggestions by the way.

Zhihu is very friendly to reptiles. So it's actually pretty easy to get.

This is Zhihu just entered the page must be logged in. But sometimes you need a CAPTCHA and sometimes you don't. So this also needs to make a judgment.

Our ultimate goal is to build the Headers and Form-Data objects required for the POST request.

Continuing to look at the Requests Headers information, compared with the GET request on the login page, it is found that there are three more authentication fields in the header of this POST, which has been tested that x-xsrftoken is required.

X-xsrftoken is the Token authentication against Xsrf cross-site, which can be found in the Set-Cookie field of Response Headers when visiting the home page.

Specific form-data: please refer to Zhihu column: https://zhuanlan.zhihu.com/p/34073256

There is no function of clicking handstand Chinese characters in my code. If you are interested, you can try it.

Def start_requests (self):

# enter the login page and call back the function start_login ()

Yield scrapy.Request ('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',headers=self.headers,callback=self.start_login, meta= {' cookiejar':1},) # meta= {'cookiejar':1}

This is directly to get its CAPTCHA, URL followed by the parameter lang=en, he will request only the English CAPTCHA, there will be no handstand Chinese characters.

Def start_login (self,response):

# determine whether a CAPTCHA is required

Need_cap=json.loads (response.body) ['show_captcha']

# re.search (ringing trueworthy, resp.text)

Print (need_cap)

If need_cap:

Print ('CAPTCHA required')

Yield scrapy.Request ('https://www.zhihu.com/api/v3/oauth/captcha?lang=en',headers=self.headers,callback=self.capture,method='PUT', meta= {' cookiejar': response.meta ['cookiejar']})

Else:

Print ('No CAPTCHA required')

Post_url = 'https://www.zhihu.com/api/v3/oauth/sign_in'

Post_data = {

'client_id': self.client_id

'grant_type': self.grant_type

'timestamp': self.timestamp

'source': self.source

'signature': self.get_signnature (self.grant_type, self.client_id, self.source, self.timestamp)

'username':' + 86177777777'

'password': '123456'

'captcha':''

# changed to 'cn' is the handstand Chinese character verification code

'lang': 'en'

'ref_source': 'other_'

'utm_source':'}

Yield scrapy.FormRequest (url=post_url, formdata=post_data, headers=self.headers, meta= {'cookiejar': response.meta [' cookiejar']},)

The small problem with this function is that the CAPTCHA is actually requested every time, so sometimes you still have to enter it without CAPTCHA. To improve it, you need to use a regular to match the TRUE of the response. But it doesn't matter, because you only need to log in once and save cookies.

Def capture (self,response):

Try:

Img = json.loads (response.body) ['img_base64']

Except ValueError:

Print ('failed to get the value of img_base64!')

Else:

Img = img.encode ('utf8')

Img_data = base64.b64decode (img)

With open ('zhihu.gif',' wb') as f:

F.write (img_data)

F.close ()

Captcha = input ('Please enter CAPTCHA:')

Post_data = {

'client_id': self.client_id

'grant_type': self.grant_type

'timestamp': self.timestamp

'source': self.source

'signature': self.get_signnature (self.grant_type, self.client_id, self.source, self.timestamp)

'username':' + 8617777777777777'

'password': '123456'

'captcha': captcha

'lang': 'en'

'ref_source': 'other_'

'utm_source':''

'_ xsrf':' 0sQhRIVITLlEX8kQWA09VOqsPlSqRJQT'

}

Yield scrapy.FormRequest (

Url=' https://www.zhihu.com/signin',

Formdata=post_data

Callback=self.after_login

Headers=self.headers

Meta= {'cookiejar': response.meta [' cookiejar']}

) COOKIES_ENABLED = True

Modify this in setting.py. It means to use a self-defined cookie.

Def after_login (self, response):

If response.status = = 200:

Print ("login successful")

After the login is completed, the first user starts to climb the data "".

Return [scrapy.Request (

Self.start_url

Meta= {'cookiejar': response.meta [' cookiejar']}

Callback=self.parse_people

Errback=self.parse_err

)]

Else:

Print ("login failed")

Login success to request the next method, login failure can print the contents of the response or re-enter, I did not specifically write this part.

About Scrapy crawling Zhihu how to simulate login to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.