In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how Python climbs the subway line map. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
Introduction to BeautifulSoup (bs4) Library
BeautifulSoup: "delicious soup, green thick soup", is a Python library that can extract data from HTML or XML files. BeautifulSoup provides some simple, python-style functions for navigating, searching, modifying analysis trees, and so on. It is a toolbox that provides users with the data they need to grab by parsing documents, and because it is simple, it doesn't take much code to write a complete application.
BeautifulSoup automatically converts the input document to Unicode encoding and the output document to utf-8 encoding. You do not need to consider the encoding method, unless the document does not specify an encoding method, then BeautifulSoup can not automatically recognize the encoding method. Then, you just need to explain the original coding.
You can install directly using pip:
Pip install BeautifulSoup
BeautifulSoup supports the HTML parser in the Python standard library, as well as some third-party parsers, such as lxml,XML,html5lib, etc., but the corresponding libraries need to be installed:
Pip install lxml
If we don't install it, Python will use Python's default parser. Lxml parsers are more powerful and faster, and installation is recommended.
Next, we take crawling the Shenzhen subway line data on the 8684 website as an example to carry out the actual crawler combat.
Access to the website of Metro Line
First, import the libraries bs4, lxml, requests:
Import requestsimport jsonfrom bs4 import BeautifulSoupimport lxml.html
Get the website of subway line
Etree = lxml.html.etreedef get_urls (): url= 'https://dt.8684.cn/so.php?k=pp&dtcity=sz&q=wedwed' response = requests.get (url=url) soup = BeautifulSoup (response.content) 'lxml'). Find (attrs= {' class': 'fs-fdLink'}) routes = soup.find (' tr'). Find ('td'). FindAll (' a') # find all relevant nodes route_list = [] for i in routes: per = {} per ['key'] = i.string per [' value'] = 'https://dt.8684.cn' + I. Get ('href') route_list.append (per) return route_list
Among them
Soup = BeautifulSoup (response.content, 'lxml')
To create a BeautifulSoup object, the meaning is to create the response data returned by sending a network request as a soup object in the data format of lxml.
Acquisition of line data under specific lines def get_schedule (url): response = requests.get (url ['value']) soup = BeautifulSoup (response.content) 'lxml') .find (attrs= {' class': 'ib-mn'}) table_head = soup.find (attrs= {' class': 'pi-table tl-table'}) per_info = table_head.findAll (' tr') route_list = [] if (url ['key']. Find (' inner circle') = =-1 and url ['key']. Find (' outer ring') = =-1): # Metro line name It is said that there is neither inner ring nor outer ring route_info = {} # define the subway inner ring dictionary route_info_wai = {} # define the subway outer ring dictionary stations_nei = [] # define the subway inner ring platform list route_info ['name'] = url [' key'] + 'inner ring' # define the value route_info_wai of the inner ring line dictionary route_info_wai [' Name'] = url ['key'] +' Outer Ring'# define the value of the outer ring line dictionary name key time_nei_1 = [] # define the subway inner ring departure time list time_nei_2 = [] # define the subway inner ring departure time list time_wai_1 = [] # define the subway outer ring departure time list time_wai_2 = [] # set For i in per_info: if (I! = []): J = i.findAll ('td') if (j! = []): for k in j [0]: stations_nei.append (k.text) # roll call of inner and outer ring station For k in j [3]: time_nei_2.append (k) # Inner Ring departure time for k in j [1]: time_nei_1.append (k) # Inner Ring departure time for k in j [4]: Time_wai_2.append (k) # Outer Ring departure time for k in j [2]: time_wai_1.append (k) # Outer Ring departure time try: if (time_nei_1 [0]! ='- 'and time_nei_1 [0]! =' -'): Route_info ['startTime'] = startTime = time_nei_1 [0] # screen the value else: route_info [' startTime'] = time_nei_1 [1] # define the departure time if (time_nei_2 [len (time_nei_2)-1]! ='--'and time _ nei_2 [len (time_nei_2)-1]! ='-'): route_info ['endTime'] = time_nei_2 [len (time_nei_2)-1] # screen the value else: route_info [' endTime'] = time_nei_2 [len (time_nei_2)-2 ] # define the closing time if (time_wai_1 [len (time_wai_1)-1]! ='- 'and time_wai_1 [len (time_wai_1)-1]! =' -'): route_info_wai ['startTime'] = time_wai_1 [len (time_wai_1)-1] # screen the outer ring of the subway The value else: route_info_wai ['startTime'] = time_wai_1 [len (time_wai_1)-2] # defines the departure time of metro outer ring line if (time_wai_2 [0]! ='-'and time_wai_2 [0]! =' -'): route_ Info_wai ['endTime'] = startTime = time_wai_2 [0] # screen the value else: route_info_wai [' endTime'] = time_wai_2 [1] # define the closing time of the subway outer ring line except IndexError as e: route_info_wai ['startTime'] =' 06VOG 00' Route_info_wai ['endTime'] =' 23route_info 00' route_info ['startTime'] =' 06VOL00' route_info ['endTime'] =' 23VOOOO data cannot be found Catches the index exception error, and assigns the default value route_info ['stations'] = stations_nei # inner ring dictionary stations value to the inner ring station name route_info_wai [' stations'] = list (reversed (stations_nei)) # reverses the list value and forces the data type to list As the outer ring station name route_list.append (route_info) route_list.append (route_info_wai) # insert the outer ring line dictionary into the list else: # Metro line names include "inner ring" or "outer ring" route_info = {} stations = [] route_info ['name'] = url [' key'] # define the line dictionary name Key value time_1 = [] # define subway departure time list time_2 = [] # define subway departure time list for i in per_info: if (I! = []): J = i.findAll ('td') if (j! = []): # filter out empty data related to the header For k in j [0]: stations.append (k.text) # Metro line station roll call for k in j [1]: time_1.append (k) # Metro line departure time for k in j [2]: Time_2.append (k) # Metro line closing time if (time_1 [0]! ='- 'and time_1 [0]! =' -'): route_info ['startTime'] = startTime = time_1 [0] # filter out the value else: route_info [' startTime'] = time_1 [1] # define the departure time of subway lines if (time_2 [len (time_2)-1]! ='- 'and time_2 [len (time_2)-1]! =' -'): route_info ['endTime'] = time_2 [len (time_2)-1] # screen the value else: route_ where the departure time of subway lines is empty Info ['endTime'] = time_2 [len (time_2)-2] # define the subway line closing time route_info [' stations'] = stations# line dictionary stations value assigned to the station name route_list.append (route_info) # insert the line dictionary into the list return route_list
Among them, the if statement (most of the Part2 code) is used for special processing of unconventional line data, which also confirms that "80% of the code handles 20% of the cases". Conventional line data can be crawled with only the first six lines of code, and finally return to the line list rout_list.
Integrating data for i in get_urls (): # traversing the urban subway line URL for j in get_schedule (I): json_str = json.dumps (j, ensure_ascii=False) # converting data format to json format, ignoring format conversion errors for mandatory conversion print (json_str), this is the end of the article on "how Python crawls Metro Line Map". Hope that the above content can be helpful to you, so that you can learn more knowledge, if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.