In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article will explain in detail how to use python crawler to crawl university ranking information, Xiaobian thinks it is quite practical, so share it for everyone to make a reference, I hope you can gain something after reading this article.
2. Please search for "Afanti"(purely technical discussion)
"Afan"(purely technical discussion)
3. In this website, select to check colleges, others are default
4. The information crawled this time is mainly the content of the red box in the figure below. In the browser developer, click XHR to find this interface. The content of the interface has the information we need.
5. Build the request header first, copy the request header directly
#Build Request Header
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'contentType': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': 'cfm-major=true',
'Host': 'gaokao.afanti100.com',
'media': 'PC',
'Referer': 'http://gaokao.afanti100.com/university.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
6. Next, request this url first, and splice the url through the format method to achieve the effect of turning pages. By looking at the content of the interface, we find that it is in json format. The information of the university is in the university_lst in the data key, so we need to take out this key, where university_lst is a list.
def get_index():
page = 1
while True:
if page > 188:
break
url = 'http://gaokao.afanti100.com/api/v1/universities/? degree_level=0&directed_by=0' \
'&university_type=0&location_province=0&speciality=0&page={}'.format(page)
# page self-increment to realize page turning
page += 1
#Request url and return json format
resp = requests.get(url, headers=headers).json()
#Take out the key-value pair where the university is located
university_lsts = resp.get('data').get('university_lst')
if university_lsts:
get_info(university_lsts)
else:
continue
7. After extracting the key-value pairs from the previous step, we can traverse the list to extract the information we want.
def get_info(university_lsts):
#Determine if the list is not empty
if university_lsts:
#Traverse the list to extract information from each university
for university_lst in university_lsts:
#Declare a dictionary to store data
data_dict = {}
University name
data_dict['name'] = university_lst.get('name')
#University Rankings
data_dict['ranking'] = university_lst.get('ranking')
#University Label
data_dict['tag_lst'] = university_lst.get('tag_lst')
#University Key Subjects
data_dict['key_major_count'] = university_lst.get('key_major_count')
#Master's Points
data_dict['graduate_program_count'] = university_lst.get('graduate_program_count')
#Doctor Points
data_dict['doctoral_program_count'] = university_lst.get('doctoral_program_count')
Is it 211
data_dict['is_211'] = university_lst.get('is_211')
#Is it 985
data_dict['is_985'] = university_lst.get('is_985')
Which province
data_dict['location_province'] = university_lst.get('location_province')
Which city
data_dict['location_city'] = university_lst.get('location_city')
Type of University
data_dict['university_type'] = university_lst.get('university_type')
data_list.append(data_dict)
print(data_dict)
8. Finally, the information is stored as a file
def save_file():
#Store data as json file
with open ('University Ranking Info. json','w', encoding ='utf-8') as f:
json.dump(data_list, f, ensure_ascii=False, indent=4)
print ('json file saved successful')
#Store data as csv file
#Header
title = data_list[0].keys()
with open ('University Ranking Info. csv','w', encoding ='utf-8', newline ='') as f:
writer = csv.DictWriter(f, title)
#Write to header
writer.writeheader()
#Write data
writer.writerows(data_list)
print ('csv file saved successful')
9. This crawler is very simple, novice can be used to practice hand, all code attached
import requests
import json
import csv
#Build Request Header
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'contentType': 'application/x-www-form-urlencoded; charset=utf-8',
'Cookie': 'cfm-major=true',
'Host': 'gaokao.afanti100.com',
'media': 'PC',
'Referer': 'http://gaokao.afanti100.com/university.html',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
#Declare a list storage dictionary
data_list = []
def get_index():
page = 1
while True:
if page > 188:
break
url = 'http://gaokao.afanti100.com/api/v1/universities/? degree_level=0&directed_by=0' \
'&university_type=0&location_province=0&speciality=0&page={}'.format(page)
# page self-increment to realize page turning
page += 1
#Request url and return json format
resp = requests.get(url, headers=headers).json()
#Take out the key-value pair where the university is located
university_lsts = resp.get('data').get('university_lst')
if university_lsts:
get_info(university_lsts)
else:
continue
def get_info(university_lsts):
#Determine if the list is not empty
if university_lsts:
#Traverse the list to extract information from each university
for university_lst in university_lsts:
#Declare a dictionary to store data
data_dict = {}
University name
data_dict['name'] = university_lst.get('name')
#University Rankings
data_dict['ranking'] = university_lst.get('ranking')
#University Label
data_dict['tag_lst'] = university_lst.get('tag_lst')
#University Key Subjects
data_dict['key_major_count'] = university_lst.get('key_major_count')
#Master's Points
data_dict['graduate_program_count'] = university_lst.get('graduate_program_count')
#Doctor Points
data_dict['doctoral_program_count'] = university_lst.get('doctoral_program_count')
Is it 211
data_dict['is_211'] = university_lst.get('is_211')
#Is it 985
data_dict['is_985'] = university_lst.get('is_985')
Which province
data_dict['location_province'] = university_lst.get('location_province')
Which city
data_dict['location_city'] = university_lst.get('location_city')
Type of University
data_dict['university_type'] = university_lst.get('university_type')
data_list.append(data_dict)
print(data_dict)
def save_file():
#Store data as json file
with open ('University Ranking Info. json','w', encoding ='utf-8') as f:
json.dump(data_list, f, ensure_ascii=False, indent=4)
print ('json file saved successful')
#Store data as csv file
#Header
title = data_list[0].keys()
with open ('University Ranking Info. csv','w', encoding ='utf-8', newline ='') as f:
writer = csv.DictWriter(f, title)
#Write to header
writer.writeheader()
#Write data
writer.writerows(data_list)
print ('csv file saved successful')
def main():
get_index()
save_file()
if __name__ == '__main__':
main()
About "how to use python crawler to crawl university ranking information" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it to let more people see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.