In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article is to share with you about how python crawled the data list of 123 fans. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
I. the original information of the website
Let's first take a look at the original website page.
If we want to copy these data one by one, and then analyze them, it is estimated that it will take a day to process the ranking data of stars. It is estimated that it will be dealt with until it collapses, and there may be errors due to human reasons.
With a crawler, the data can be processed in less than half an hour. Let's take a look at how to climb down the data using Python.
Second, let's take a look at some screenshots of the crawled data.
1 data on the popularity list of male stars
Third, how to get the crawler information of 123 fan webs
Here are the specific steps to get the information used by the code:
Step1: open 123fans in browsers (usually use Firefox and Google).
Step2: press keyboard F12-> ctrl+r
Step3: click results.php-> to find the parameters required by the code in Headers
Fourth, step-by-step crawler code analysis
1 obtain web page information with Requests library in Python
# crawl the current page information and parse it into a standard format with BeautifulSoup import requests # Import requests module import bs4url = "https://123fans.cn/lastresults.php?c=1"headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Request Method':'Get'} req = requests.get (url, timeout=30, headers=headers) soup = bs4.BeautifulSoup (req.text, "html.parser")
Code parsing:
Url =: the url link of the page to be crawled is equivalent to specifying the path for crawling comments. Enter the Requests URL value marked in the step3 above.
Headers =: to crawl the first information of the web page, fill in the content after the keywords in the Headers marked in the step3 above.
Req =: use the get method to get all the information about the web page to be climbed.
Soup: use BeautifulSoup to parse the crawled content into a standard format to facilitate data processing.
Note 1: some websites must be visited with information such as a browser, and an error will be reported if you do not input headers, so some header information has been added in this example. I tried the link to work without the first message, which is exactly the same as the result obtained by adding the first message.
Note 2: if you are not familiar with the Requests library, you can refer to the article [Python] [crawler] Requests library in this official account for details.
2 integrate the crawled data into a data box
# integrate the crawled data into the data box import re # regular expression library import numpy as np import pandas as pdperiod_data = pd.DataFrame (np.zeros ((400) # construct 400 rows and 5 columns full zero matrix standby period_data.columns = ['name',' popularity_value', 'period_num',' end_time' 'rank'] # name the 0-moment array # fill in the current data in the table # name information I = 0 name = soup.findAll ("td", {"class": "name"}) for each in name: period_data [' name'] [I] = each.a.text # add the name I + = popularity information j = 0popularity = soup.findAll ("td") {"class": "ballot"}) for each in popularity: period_data ['popularity_value'] [j] = float (each.text.replace (",')) # add popularity value j + = period information period_num = int (re.findall ('[0-9] +') Str (soup.h3.text)) [0]) period_data ['period_num'] = period_num# due date end_time_0 = str (re.findall (' end date. + [0-9] +', str (soup.findAll ("div", {"class": "results"}) .split ('.) end_time =''for str_1 in end_time_0: end_time = end_time + re.findall (' [0-9] +') Str_1) [0] period_data ['end_time'] = end_time# ordered number How many bits period_data_1 = period_data.sort_values (by='popularity_value',ascending=False) period_data_1 ['rank'] = range (period_data_1.shape [0]) before easy intercept
Code parsing:
Period_data: construct a matrix of 400 rows and 5 columns to store the ranking data of each issue. (the previous rankings stored the popularity of the top 341stars, so I was afraid there would be a little more data in the past, so I took 400 rows).
Period_data.columns: add a column name to the data.
Name: use the findAll function to extract all the name information.
For each in name: use a loop to store name information in period_data.
Popularity: use the findAll function to extract all the human value information.
For each in popularity: use a loop to store popularity information in period_data.
Period_num: get information about the number of periods.
End_time: gets the due date.
Period_data_1 ['rank']: add ordered numbers to the last column to facilitate data interception.
Next, show the batch crawler code.
Fifth, batch crawler code parsing
1 define crawler function
Import requests # Import requests module import bs4import re# regular expression library import numpy as np import pandas as pdimport warningsimport timeimport randomwarnings.filterwarnings ('ignore') # ignore the contents of ignore#headers you can find headers = {' User-Agent':'Mozilla/5.0 (Windows NT 6.1in Headers) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36', 'Request Method':'Get'} def crawler (url): req = requests.get (url, timeout=30, headers=headers) # get web page information soup = bs4.BeautifulSoup (req.text "html.parser") # parse period_data = pd.DataFrame (np.zeros ((400jue 5) # with soup library to construct 400 rows and 5 columns of all-zero matrix standby period_data.columns = ['name',' popularity_value', 'period_num',' end_time' 'rank'] # name the 0-moment array # fill in the current data in the table # name information I = 0 name = soup.findAll ("td", {"class": "name"}) for each in name: period_data [' name'] [I] = each.a.text # add the name I + = 1 # popularity information j = 0 popularity = soup.findAll ("td") {"class": "ballot"}) for each in popularity: period_data ['popularity_value'] [j] = float (each.text.replace (",",')) # add popularity value j + = 1 # period information period_num = int (re.findall ('[0-9] +') Str (soup.h3.text)) [0]) period_data ['period_num'] = period_num # due date end_time_0 = str (re.findall (' end date. + [0-9] +', str (soup.findAll ("div", {"class": "results"}) .split ('.') End_time =''for str_1 in end_time_0: end_time = end_time + re.findall (' [0-9] +', str_1) [0] period_data ['end_time'] = end_time # ordered number Convenient to intercept the first number of bits period_data_1 = period_data.sort_values (by='popularity_value',ascending=False) period_data_1 ['rank'] = range (period_data_1.shape [0]) return period_data_1
This code integrates the segmented crawler code into a function to facilitate repeated calls.
2 call function repeatedly to realize batch crawler
Period_data_final = pd.DataFrame (np.zeros ((1Power5) # construct 400 rows and 5 columns of all-zero matrix standby period_data_final.columns = ['name',' popularity_value', 'period_num',' end_time','rank'] # name the zero-moment array for qi in range (538, 499): print ("climbing to the first place", qi If qi = 538: url= "https://123fans.cn/lastresults.php?c=1" else: url=" https://123fans.cn/results.php?qi={}&c=1".format(qi) time.sleep (random.uniform (1,2)) date = crawler (url) period_data_final = period_data_final.append (date) period_data_final_1 = period_data_ fina.loc1: :] # remove the first row of useless data
This code repeatedly calls the crawler function to get the page data and integrates it into a data box with append.
Thank you for reading! This is the end of the article on "how python climbs the data list of 123 fans". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.