In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Python how to climb Chinese university rankings and save to excel, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
Preface
Today is the python crawler crawled Chinese university rankings, and saved to excel, of course, this code is very simple, I used half an hour to write, my overall framework is very clear, you can directly use, but also hope that a rookie can learn some knowledge about the reptile, of course, I am just in the study, there are bad places also trouble bosses to correct! Thank you!
Climb the ranking of Chinese universities
URL: http://m.gaosan.com/gaokao/265440.html
Request gets htmlbeautiful soup parsing web pages re regular expression matching content New and saves excel 1from bs4 import BeautifulSoup # Web page parsing acquisition data 2import re # regular expressions for text matching 3import urllib.request Urllib.error # formulate url to obtain web page data 4import xlwt 5 6def main (): 7 baseurl = "http://m.gaosan.com/gaokao/265440.html" 8 # 1 crawled web page 9 datalist = getData (baseurl) 10 savepath =" Chinese university rankings .xls "11 saveData (datalist Savepath) 121 regular expression 14paiming = re.compile (r'(. *). *') # create a hyperlink regular expression object Represents a string pattern Rule 15xuexiao = re.compile (ringing. * (. *). *') 16defen = re.compile (ringing. Crawler. (. *). *. *') 17xingji = re.compile (rus.* (. *). *') 18cengci = re.compile (rus.* (.*)') 19 2 crawl web pages 21def getData (baseurl): 22 datalist = [] 23 html = askURL (baseurl) # Save the obtained web page source code 24 # print (html) 25 # [one by one] parse the data (one web page is parsed once) 26 soup = BeautifulSoup (html "html.parser") # soup is the parsed tree structure object 27 for item in soup.find_all ('tr'): # find the string formation list 28 # print (item) # Test View item all 29 data = [] # Save all information about a school 30 item = str (item) 31 # Rank 32 paiming1 = re.findall (paiming Item) # re regular expression lookup specified string 0 indicates that as long as the first is preceded by the standard and followed by the standard, the range 33 # print (paiming1) 34 if (not paiming1): 35 pass 36 else: 37 print (paiming1 [0]) 38 data.append (paiming1) 39 if (paiming1 in data): 40 # School name 41 Xuexiao1 = re.findall (xuexiao Item) [0] 42 # print (xuexiao1) 43 data.append (xuexiao1) 44 # score 45 defen1 = re.findall (defen, item) [0] 46 # print (defen1) 47 data.append (defen1) 48 # Star 49 xingji1 = re.findall (xingji Item) [0] 50 # print (xingji1) 51 data.append (xingji1) 52 # level 53 cengci1 = re.findall (cengci) Item) [0] 54 # print (cengci1) 55 data.append (cengci1) 56 # print ('-'* 80) 57 datalist.append (data) # put a processed school information into datalist 58 return datalist 59 60 6 get a specified url web page content 62def askURL (url): 63 # my initial visit user agent 64 head = {# simulated browser header information 65 "User-Agent" for masquerading messages to Douban server: "Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36 "66} 67 # user agent tells Douban server what kind of machine we are-the essence of the browser is to tell the browser what level of file content 68 request = urllib.request.Request (url) we can accept. Headers=head) # access url 69 # access with request object 70 html = "" 71 try: 72 response = urllib.request.urlopen (request) # pass encapsulated request object 73 html = response.read (). Decode ("utf-8") # read read can decode and prevent garbled 74 # print (html) 75 except urllib.error.URLError as e: 76 if hasattr (e) "code"): 77 print (e.code) # print error code 78 if hasattr (e, "reason"): 79 print (e.reason) # cause of print error 80 return html 81 828 save data 84def saveData (datalist, savepath): 85 book = xlwt.Workbook (encoding= "utf-8") Style_compression=0) # create workbook object style compression effect 86 sheet = book.add_sheet ('China University ranking', cell_overwrite_ok=True) # create worksheet A form cell covers 87 for i in range (0640): 88 print ("% d"% (I + 1)) 89 data = datalist [I] 90 # print (data) 91 for j in range (0 5): # each row of data is stored in 92 sheet.write (I, j) Data [j]) # data 93 book.save (savepath) # Save data Table 94 959 main function 97if _ _ name__ = "_ _ main__": # when the program is executed, 98 # # call function execution entry 99 main () 100 # init_db ("movietest.db") 101 print ("crawl complete!")
The specific implementation results are as follows
There are more than 600 pieces of data.
The specific process has been clearly marked in the code comments, if you do not understand can leave a message, if the improvement, trouble bosses correct, thank you!
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.