In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article is about how to use Python to crawl school professional data, the editor feels very practical, so share with you to learn, I hope you can learn something after reading this article, say no more, follow the editor to have a look.
Preface
Today, the 2020 college entrance examination results of various places can be inquired one after another, and the voluntary application of candidates is also put on the agenda.
As the saying goes, seven points in the exam, three points in the newspaper. Presumably, students must not want to waste their scores because of high marks and low grades, nor do they want to miss out on universities because of low marks and high grades.
How to get data
We used Python to obtain a total of 2904 university data from China's education online website. The following shows part of the data acquisition code:
Https://gkcx.eol.cn/school/search
The specific ideas are as follows:
By analyzing the web page, we can find that the data is loaded dynamically by turning the page, so through the Chrome browser to grab the packet analysis to obtain the real URL request address, and determine the request method (get or post)
Use requests to request web page data
Parsing and extracting data using json
Use pandas to save data locally
First, open the URL, use the check feature of the Chrome browser, switch to Network-XHR, and click the page to capture the network data package. It is easy to find that the data is encapsulated in the json, as shown in the following figure:
Switch to Headers, determine that the method of the request is post request, and get the URL address of the data request, where the page parameter represents the number of pages, and all the data can be obtained by traversing. The code is as follows:
# Import package import numpy as npimport pandas as pdimport requestsimport jsonfrom fake_useragent import UserAgentimport time# to get a page of def get_one_page (page_num): # get URL url = 'https://api.eol.cn/gkcx/api/' # Construction headers headers = {' User-Agent': UserAgent (). Random, 'Origin':' https://gkcx.eol.cn', 'Referer':' https://gkcx.eol.cn/school/search?province=&schoolflag=&recomschprop=',} # construct data data = {'access_token': ",' admissions':", 'central': ",' department':", 'dual_class': ",' f211:" 'f985': ", 'is_dual_class':",' keyword': ", 'page': page_num,' province_id':", 'request_type': 1,' school_type': ", 'size': 20,' sort':" view_total ", 'type':" 'uri': "apigkcx/api/school/hotlists",} # initiate request try: response = requests.post (url=url, data=data, headers=headers) except Exception as e: print (e) time.sleep (3) response = requests.post (url=url, data=data) Headers=headers) # parsing to obtain data school_data = json.loads (response.text) ['data'] [' item'] # School name school_name = [i.get ('name') for i in school_data] # affiliated department belong = [i.get (' belong') for i in school_data] # University level dual_class_name = [i.get ('dual_class_name) ') for i in school_data] # whether 985 f985 = [i.get (' f985') for i in school_data] # whether 211 f211 = [i.get ('f211') for i in school_data] # School type level_name = [i.get (' level_name') for i in school_data] # College type type_name = [i.get ('type_name') for I In school_data] # whether public nature_name = [i.get ('nature_name') for i in school_data] # popularity view_total = [i.get (' view_total') for i in school_data] # Provincial province_name = [i.get ('province_name') for i in school_data] # City city_name = [i.get (' city_name') ) for i in school_data] # area county_name = [i.get ('county_name') for i in school_data] # Save data df_one = pd.DataFrame ({' school_name': school_name 'belong': belong,' dual_class_name': dual_class_name, 'f985,' f211: f211, 'level_name': level_name,' type_name': type_name, 'nature_name': nature_name,' view_total': view_total, 'province_name': province_name,' city_name': city_name 'county_name': county_name }) return df_one# gets multi-page def get_all_page (all_page_num): # Storage table df_all = pd.DataFrame () # number of circular pages for i in range (all_page_num): # print progress print (f' is getting university information on page {I + 1}') # call function Df_one = get_one_page (page_num=i+1) # add df_all = df_all.append (df_one Ignore_index=True) # Random hibernation time.sleep (np.random.uniform (2)) return df_allif _ _ name__ = ='_ main__': # run function df = get_all_page (all_page_num=148)
Through the above program, a total of 2904 pieces of data are obtained. The data preview is as follows:
Df.head ()
The above is how to use Python to crawl the school professional data, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.