In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "how to use Python crawler to simulate Zhihu login". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Login principle
The principle of Cookie is very simple, because HTTP is a stateless protocol, so in order to maintain session (session) state on top of the stateless HTTP protocol and let the server know which client is currently dealing with, Cookie technology has emerged, and Cookie is equivalent to an identity assigned to the client by the server.
The browser does not carry any Cookie information when it initiates the HTTP request for the first time
The server returns the HTTP response together with a Cookie message to the browser.
The second request from the browser sends the Cookie information returned by the server to the server.
When the server receives the HTTP request and finds the Cookie field in the request header, it knows that it has dealt with this user before.
Practical application
Anyone who has used Zhihu knows that you can log in as long as you provide your user name and password as well as the verification code. Of course, this is just what we see in our eyes. The hidden technical details need to be excavated with the help of browsers. Now let's use Chrome to see what happens when we fill out the form.
(if you are already logged in, log out first) first go to Zhihu's login page https://www.zhihu.com/#signin, open the developer bar of Chrome (press F12) and first try to enter an incorrect CAPTCHA to see how the browser sends the request.
Several key information can be found from the browser's request
The URL address logged in is https://www.zhihu.com/login/email
Four forms of data are required for login: user name (email), password (password), verification code (captcha), and _ xsrf.
The URL address to get the CAPTCHA is https://www.zhihu.com/captcha.gif?r=1490690391695&type=login
What is _ xsrf? If you are familiar with CSRF (cross-site request forgery) attacks, you must know what it does. Xsrf is a string of pseudorandom numbers used to prevent cross-site request forgery. It usually exists in the form form tag of a web page. To confirm this, you can search for "xsrf" on the page. Sure enough, _ xsrf is in a hidden input tag.
Once you've figured out how to get the data that the browser needs to log in, you can now start writing code to log in using the Python simulation browser. The two third-party libraries you rely on when logging in are requests and BeautifulSoup. Install them first.
Pip install beautifulsoup4==4.5.3
Pip install requests==2.13.0
The http.cookiejar module can be used to automatically handle HTTP Cookie,LWPCookieJar objects, which is the encapsulation of cookies, which supports saving and loading cookies to and from files.
The session object provides the persistence and connection pooling functions of Cookie, and you can send requests through the session object.
First load the cookie information from the cookies.txt file, because there is no cookie in the first run, so there will be a LoadError exception.
From http import cookiejar
Session = requests.session ()
Session.cookies = cookiejar.LWPCookieJar (filename='cookies.txt')
Try:
Session.cookies.load (ignore_discard=True)
Except LoadError:
Print ("load cookies failed") gets xsrf
We have found the tag where xsrf is located, and it is very convenient to get this value by using BeatifulSoup's find method.
Def get_xsrf ():
Response = session.get ("https://www.zhihu.com", headers=headers)
Soup = BeautifulSoup (response.content, "html.parser")
Xsrf = soup.find ('input', attrs= {"name": "_ xsrf"}) .get ("value")
Return xsrf gets the CAPTCHA
The CAPTCHA is returned through the / captcha.gif API. Here, we download and save the CAPTCHA image to the current directory and identify it manually. Of course, you can use a third-party support library to automatically identify it, such as pytesser.
Def get_captcha ():
"
Save the CAPTCHA picture to the current directory and identify the CAPTCHA manually
: return:
"
T = str (int (time.time () * 1000))
Captcha_url = 'https://www.zhihu.com/captcha.gif?r=' + t + "& type=login"
R = session.get (captcha_url, headers=headers)
With open ('captcha.jpg',' wb') as f:
F.write (r.content)
Captcha = input ("CAPTCHA:")
Return captcha login
Once all the parameters are ready, you can request a login interface.
Def login (email, password):
Login_url = 'https://www.zhihu.com/login/email'
Data = {
'email': email
'password': password
'_ xsrf': get_xsrf ()
"captcha": get_captcha ()
'remember_me': 'true'}
Response = session.post (login_url, data=data, headers=headers)
Login_code = response.json ()
Print (login_code ['msg'])
For i in session.cookies:
Print (I)
Session.cookies.save ()
After the request is successful, session will automatically populate the cookie information returned by the server into the session.cookies object, and the client can automatically bring these cookie to visit the pages that need to log in the next request.
"how to use Python crawler to achieve simulated Zhihu login" content is introduced here, thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.