In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
In this issue, the editor will bring you about how to use python to climb the top 100 movies. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.
What does python mean Python is a cross-platform, interpretive, compiled, interactive and object-oriented scripting language, originally designed to write automated scripts, and is often used to develop independent and large-scale projects as versions are constantly updated and new features are added.
Flowchart of website crawling:
To realize the project, we need to use the following knowledge points.
First, get the web page
1. Find the rules of web pages
two。 Use the for loop statement to get the links to the first four pages of the site
3. Use the Network tab to find Headers information
4. Use the requests.get () function to request a web page with Headers.
Second, analyze the web page
1. Use BeautifulSoup to parse web pages
two。 Use the BeautifulSoup object to call the find_all () method to locate the tag that contains all the information about a single movie
3. Use Tag.text to extract serial numbers, movie names, ratings, and recommendations
4. Use Tag ['attribute name'] to extract the movie details link.
Third, store data
1. Use with open () as... Create a csv file to write to
two。 Use csv.DictWriter () to convert a file object to a DictWriter object
3. The parameter fieldnames is used to set the header of the csv file
4. Write headers using writeheader ()
5. Use writerows () to write the contents to the csv file.
Implementation code: import csvimport requestsfrom bs4 import BeautifulSoup# setting list, which is used to store information for each movie data_list = [] # setting request header headers = {'User-Agent':' Mozilla/5.0 (Macintosh) Intel Mac OS X 10 / 14 / 6) AppleWebKit/537.36 (KHTML Like Gecko) Chrome/85.0.4183.102 Safari/537.36'} # use for loop to iterate through data for page_number in range with values in the range of 0x3 (4): # set the web page link to request url = 'https://movie.douban.com/top250?start={}&filter='.format(page_number * 25) # request web page movies_list_res = requests.get (url Headers=headers) # parse the requested web content bs = BeautifulSoup (movies_list_res.text, 'html.parser') # search all the Tag movies_list = bs.find_all (' div') in the web page that contains all the information about a single movie Class_='item') # iterate through the search results using for loop for movie in movies_list: # extract the sequence number of the movie movie_num = movie.find ('em'). Text # extract the movie name movie_name = movie.find (' span'). Text # extract the score of the movie movie_score = movie.find ("span") Class_='rating_num') .text # extract movie recommendation movie_instruction = movie.find ("span", class_='inq'). Text # extract movie link movie_link = movie.find ('a') ['href'] # add information to the dictionary movie_dict = {' serial number': movie_num 'Movie title': movie_name, 'rating': movie_score, 'recommendation': movie_instruction, 'Link': movie_link} # print movie information print (movie_dict) # store information for each movie data_list.append (movie_dict) # create a new csv file Used to store movie information with open ('movies.csv', 'walled, encoding='utf-8-sig') as f: # convert file objects into DictWriter objects f_csv = csv.DictWriter (f, fieldnames= [' serial number', 'movie name', 'score', 'recommendation', 'link']) # write header and data f_csv.writeheader () f_csv.writerows (data_list) code analysis:
(1) by observing the number of movies on one page of the website, we can find the information of only 25 movies on one page.
In other words, we need to crawl the movie information from the first 4 pages of the site (100 = 2504).
Here we use traversal to crawl the first four pages of data.
(2) the developer tool that opens the web page through the shortcut key (Windows users can press Ctrl + Shift + I key under the browser page or directly F12 to call out the browser developer tool, the shortcut key for Mac users is command + option + I).
Then use the pointer tool in the developer's tool to take a rough look at the location of the information you need to crawl in the first two movies to see if there are any rules.
You can find links to serial numbers, movie titles, ratings, recommendations and details in the first movie in the tag with the class attribute value of "item".
(3) Robots protocol of Douban movie Top250
I didn't see Disallow: / Top250, which means that the page can be crawled.
(4) in the Internet world, web requests store browser information in the Request Header.
As long as we copy the browser information, the crawler can successfully disguise itself as a browser as long as the crawler sets the parameters corresponding to the request header when initiating the request.
(5) Code thinking
1) skillfully using the pointer tools of developer tools can easily help us locate data.
2) after locating the location of each data with the pointer tool, check their rules.
3) if the tag you want to extract has attributes, you can use Tag.find (HTML element name, HTML attribute name ='') to extract; if there are no attributes, you can find an attribute tag near this tag, and then find () extraction.
After crawling down the information through the above steps, we go to the last step of our crawler-storing data.
(6) storing data
1) the syntax for calling the class DictWriter in the csv module is: csv.DictWriter (f, fieldnames). The parameter f in the syntax is the file object opened by the open () function; the parameter fieldnames is used to set the header of the file.
2) after executing csv.DictWriter (f, fieldnames), you will get a DictWriter object
3) the resulting DictWriter object can call the writeheader () method to write the fieldnames to the first line of the csv
4) finally, the writerows () method is called to write multiple dictionaries into the csv file.
Running result:
The generated CSV file:
The above is the editor for you to share how to use python to climb Douban top 100 movies, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.