In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Python how to use requests+re, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.
After learning the basics of python, I focused on getting started with the crawler, because I learned python for the crawler, so I found this Douban movie to climb. All right, let's cut the crap and get to the point.
1. Find the web page and analyze its structure
First go to the Douban movie Top250 page and press F12 to open the developer tool, as shown below
Then start analyzing the web page and click on the arrow in the upper left corner of the developer tool to find the data you are looking for. here, I find that the information for each movie is in the tag, so you can use regular expressions to extract each movie first. and then extract the data from each movie separately. The data for each movie is now available, but there are only 25 movies in this url, so how do you get the next page? Here we can get the link to the next page on each page, and then loop to get the movie data on the next page.
We can first click on the back page with the arrow of the developer's tool, and then display the arrow data on the right. Here we can also use regular expressions to get the link to the next page, and then the next work is a loop. All right, the analysis is over. Start tapping the code!
two。 Crawling data with object-oriented method
First request the web page with requests to get the html structure of the web page. Here, in order to prevent the anti-crawler technology of the web page, I added a request header (remember to import it before using the requests library, and those that do not have can be downloaded through pip install requests on the command line)
The request header is viewed in the developer tool, as shown in the following figure
Then use regular expressions to get the data.
Match every movie and every page of data first (the library that uses regular expressions is re)
Next, get the data for each movie.
Note: to get the above data, some of them are empty, so we still need to judge whether they are empty or not. in order to look good, I use ternary expressions to judge and store them in the dictionary after completion.
The next step is to cycle through the next page of data.
3. If you have any database foundation, you can also store them in the database. Here, I store these data in the MySQL database. The code is as follows. You need to build your own database and form first.
This is the class that manipulates the database (the library used is pymysql)
Then go back to the reptiles and store the data in the database.
4. After success, you will find the following data in the database
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.