How to use python crawler to crawl university ranking information 04/18 Update SLTechnology News&Howtos

How to use python crawler to crawl university ranking information

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to use python crawler to crawl university ranking information, Xiaobian thinks it is quite practical, so share it for everyone to make a reference, I hope you can gain something after reading this article.

2. Please search for "Afanti"(purely technical discussion)

"Afan"(purely technical discussion)

3. In this website, select to check colleges, others are default

4. The information crawled this time is mainly the content of the red box in the figure below. In the browser developer, click XHR to find this interface. The content of the interface has the information we need.

5. Build the request header first, copy the request header directly

#Build Request Header

headers = {

'Accept': '*/*',

'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Connection': 'keep-alive',

'contentType': 'application/x-www-form-urlencoded; charset=utf-8',

'Cookie': 'cfm-major=true',

'Host': 'gaokao.afanti100.com',

'media': 'PC',

'Referer': 'http://gaokao.afanti100.com/university.html',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',

'X-Requested-With': 'XMLHttpRequest',

}

6. Next, request this url first, and splice the url through the format method to achieve the effect of turning pages. By looking at the content of the interface, we find that it is in json format. The information of the university is in the university_lst in the data key, so we need to take out this key, where university_lst is a list.

def get_index():

page = 1

while True:

if page > 188:

break

url = 'http://gaokao.afanti100.com/api/v1/universities/? degree_level=0&directed_by=0' \

'&university_type=0&location_province=0&speciality=0&page={}'.format(page)

# page self-increment to realize page turning

page += 1

#Request url and return json format

resp = requests.get(url, headers=headers).json()

#Take out the key-value pair where the university is located

university_lsts = resp.get('data').get('university_lst')

if university_lsts:

get_info(university_lsts)

else:

continue

7. After extracting the key-value pairs from the previous step, we can traverse the list to extract the information we want.

def get_info(university_lsts):

#Determine if the list is not empty

if university_lsts:

#Traverse the list to extract information from each university

for university_lst in university_lsts:

#Declare a dictionary to store data

data_dict = {}

University name

data_dict['name'] = university_lst.get('name')

#University Rankings

data_dict['ranking'] = university_lst.get('ranking')

#University Label

data_dict['tag_lst'] = university_lst.get('tag_lst')

#University Key Subjects

data_dict['key_major_count'] = university_lst.get('key_major_count')

#Master's Points

data_dict['graduate_program_count'] = university_lst.get('graduate_program_count')

#Doctor Points

data_dict['doctoral_program_count'] = university_lst.get('doctoral_program_count')

Is it 211

data_dict['is_211'] = university_lst.get('is_211')

#Is it 985

data_dict['is_985'] = university_lst.get('is_985')

Which province

data_dict['location_province'] = university_lst.get('location_province')

Which city

data_dict['location_city'] = university_lst.get('location_city')

Type of University

data_dict['university_type'] = university_lst.get('university_type')

data_list.append(data_dict)

print(data_dict)

8. Finally, the information is stored as a file

def save_file():

#Store data as json file

with open ('University Ranking Info. json','w', encoding ='utf-8') as f:

json.dump(data_list, f, ensure_ascii=False, indent=4)

print ('json file saved successful')

#Store data as csv file

#Header

title = data_list[0].keys()

with open ('University Ranking Info. csv','w', encoding ='utf-8', newline ='') as f:

writer = csv.DictWriter(f, title)

#Write to header

writer.writeheader()

#Write data

writer.writerows(data_list)

print ('csv file saved successful')

9. This crawler is very simple, novice can be used to practice hand, all code attached

import requests

import json

import csv

#Build Request Header

headers = {

'Accept': '*/*',

'Accept-Encoding': 'gzip, deflate',

'Accept-Language': 'zh-CN,zh;q=0.9',

'Connection': 'keep-alive',

'contentType': 'application/x-www-form-urlencoded; charset=utf-8',

'Cookie': 'cfm-major=true',

'Host': 'gaokao.afanti100.com',

'media': 'PC',

'Referer': 'http://gaokao.afanti100.com/university.html',

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36',

'X-Requested-With': 'XMLHttpRequest',

}

#Declare a list storage dictionary

data_list = []

def get_index():

page = 1

while True:

if page > 188:

break

url = 'http://gaokao.afanti100.com/api/v1/universities/? degree_level=0&directed_by=0' \

'&university_type=0&location_province=0&speciality=0&page={}'.format(page)

# page self-increment to realize page turning

page += 1

#Request url and return json format

resp = requests.get(url, headers=headers).json()

#Take out the key-value pair where the university is located

university_lsts = resp.get('data').get('university_lst')

if university_lsts:

get_info(university_lsts)

else:

continue

def get_info(university_lsts):

#Determine if the list is not empty

if university_lsts:

#Traverse the list to extract information from each university

for university_lst in university_lsts:

#Declare a dictionary to store data

data_dict = {}

University name

data_dict['name'] = university_lst.get('name')

#University Rankings

data_dict['ranking'] = university_lst.get('ranking')

#University Label

data_dict['tag_lst'] = university_lst.get('tag_lst')

#University Key Subjects

data_dict['key_major_count'] = university_lst.get('key_major_count')

#Master's Points

data_dict['graduate_program_count'] = university_lst.get('graduate_program_count')

#Doctor Points

data_dict['doctoral_program_count'] = university_lst.get('doctoral_program_count')

Is it 211

data_dict['is_211'] = university_lst.get('is_211')

#Is it 985

data_dict['is_985'] = university_lst.get('is_985')

Which province

data_dict['location_province'] = university_lst.get('location_province')

Which city

data_dict['location_city'] = university_lst.get('location_city')

Type of University

data_dict['university_type'] = university_lst.get('university_type')

data_list.append(data_dict)

print(data_dict)

def save_file():

#Store data as json file

with open ('University Ranking Info. json','w', encoding ='utf-8') as f:

json.dump(data_list, f, ensure_ascii=False, indent=4)

print ('json file saved successful')

#Store data as csv file

#Header

title = data_list[0].keys()

with open ('University Ranking Info. csv','w', encoding ='utf-8', newline ='') as f:

writer = csv.DictWriter(f, title)

#Write to header

writer.writeheader()

#Write data

writer.writerows(data_list)

print ('csv file saved successful')

def main():

get_index()

save_file()

if __name__ == '__main__':

main()

About "how to use python crawler to crawl university ranking information" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it to let more people see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.