In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains the "Python web crawler example analysis", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Python web crawler example analysis" bar!
Let's start with a simple piece of code.
Import requests # Import requests package url = 'https://www.cnblogs.com/LexMoon/'strhtml = requests.get (url) # get to obtain web page data print (strhtml.text)
First, import requests imports the package related to the network request, then defines a string url, the target web page, and then we use the imported requests package to request the content of the web page.
Requests.get (url) is used here, and this get is not the get taken, but a method for network requests.
There are many ways to make requests on the Internet, the most common of which is get,post, and others such as put,delete are rarely seen.
Requests.get (url) sends a get request (request) to the url page, and then returns a result, that is, the response information for this request.
The response information is divided into response header and response content.
The response head is whether your visit was successful, what type of data was returned to you, and a lot more.
The response content is the source code of the web page you got.
All right, so you're getting started with Python crawlers, but there are still a lot of problems.
1. What's the difference between get and post requests?
two。 Why do I climb to some web pages, but there is no data I want?
3. Why is it that the content I climb down from some websites is different from what I actually see?
What is the difference between get and post requests?
The main difference between get and post lies in the location of the parameters, for example, there is a website that needs to log in, and when we click to log in, where the account password should be placed.
The most intuitive manifestation of the get request is that the parameters of the request are placed in the URL.
For example, if you use the keyword Baidu Python, you can find that its URL is as follows:
Https://www.baidu.com/s?wd=Python&rsv_spt=1
The dw=Python in this is one of the parameters. The parameter of the get request is? At first, separate with &.
If the site where we need to enter a password uses a get request, our personal information is easily exposed, so we need a post request.
In a post request, the parameters are placed in the request body.
For example, the following is my request to log on to the W3C website, and you can see that Request Method is in post mode.
There is also the login information we sent at the bottom of the request, which is the encrypted account password, which is sent to the other server for verification.
Why do I climb to some web pages, but there is no data I want?
Our crawlers may sometimes climb down a website and find that it is the target page when they look at the data inside, but the data we want is not there.
Most of this problem occurs on the web pages where the target data are those with phenotypes. for example, a classmate in my class asked me a question the other day. When he was climbing Ctrip's flight information, the web page he climbed down could not get the flight information. You can get it anywhere else.
As shown below:
This is a very common problem, because when he requests.get, he goes to the URL address I put on get, but although this web page is this address, the data in it is not this address.
It sounds difficult, but from the point of view of the designer of Ctrip, this part of the flight list information loaded may be huge. If you put it directly in this page, it may take so long for our users to open the page that we think the page is dead and then closed, so the designer only put the main frame in this URL request to allow users to enter the page quickly. The main flight data is loaded later, so that users don't quit because they wait a long time.
In the final analysis, what to do is for the user experience, so how should we solve this problem?
If you have studied the front-end, you should know that Ajax asynchronous requests, do not know it, after all, we are not talking about the front-end technology.
We just need to know that there is a js script in the https://flights.ctrip.com/itinerary/oneway/cgq-bjs?date=2019-09-14 page that we initially requested, which will be executed after the request, and the purpose of this script is to request the flight information we want to climb.
At this point, we can open the browser console, recommend using Google or Firefox, press F to enter the tank, no, press F12 to enter the browser console, and then click NetWork.
Here we can see all the web requests and responses that have occurred on this page.
In this, we can find that the person who asked for flight information is actually https://flights.ctrip.com/itinerary/api/12808/products, this URL.
Why is it that the content I climb down from some websites is different from what I actually see?
The last question is why the content I climb down on some sites is different from what I actually see.
The main reason for this is that your crawler is not logged in.
Just like we usually browse the web, some information needs to be logged in to access, and so is the crawler.
This involves a very important concept that our usual viewing of web pages is based on Http requests, while Http is a stateless request.
What is statelessness? You can understand that it doesn't recognize people, that is to say, your request goes to the other server, and the other server doesn't know who you are.
That being the case, why can we continue to visit this page for a long time after we log in?
This is because although Http is stateless, the other server has arranged an ID card for us, that is, cookie.
When we enter this page for the first time, if we haven't visited it before, the server will give us a cookie, and then any requests we make on this page will put cookie in it. So the server can identify who we are based on cookie.
For example, the relevant cookie can be found in Zhihu.
For this kind of website, we can get the existing cookie directly from the browser and put it into the code, requests.get (url,cookies= "aidnwinfawinf"), or we can let the crawler simulate logging on to this site to get the cookie.
Thank you for your reading, the above is the content of "Python web crawler example analysis", after the study of this article, I believe you have a deeper understanding of the Python web crawler example analysis of this problem, the specific use of the situation also needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.