How to use Python to crawl the travel data of a hornet's nest 05/07 Update SLTechnology News&Howtos

How to use Python to crawl the travel data of a hornet's nest

2025-05-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

In this issue, the editor will bring you the travel data about how to use Python to crawl the hornet's nest. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

It is the hot summer vacation, and the moments have been scanned by everyone's travel footprints. I am really amazed at those friends who have traveled all over the provinces of the country. At the same time, it also gave rise to travel-related content, this data comes from a crawler-friendly travel strategy website: ant Honehive.

First, obtain the city number

All cities, scenic spots and other information in the nest have an exclusive five-digit number. what we * is to obtain the number of cities (municipalities directly under the Central Government + prefecture-level cities) for further analysis.

The above two pages are the source of our city code. You need to first get the provincial code from the destination page, and then go to the list of provincial cities to get the code.

During the process, Selenium is required to perform dynamic data crawling. Part of the code is as follows:

Def find_cat_url (url):

Headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}

Req=request.Request (url,headers=headers)

Html=urlopen (req)

BsObj=BeautifulSoup (html.read (), "html.parser")

Bs = bsObj.find ('div',attrs= {' class':'hot-list clearfix'}) .find_all ('dt')

Cat_url = []

Cat_name = []

For i in range (0jinlen (bs)):

For j in range (0Currlen (bs[ I] .find _ all ('a')):

Cat_url.append (Bs [I]. Find _ all ('a') [j] .attrs ['href'])

Cat_name.append (Bs [I]. Find _ all ('a') [j] .text)

Cat_url = ['http://www.mafengwo.cn'+cat_url[i] for i in range (0jol Len (cat_url))]

Return cat_url

Def find_city_url (url_list):

City_name_list = []

City_url_list = []

For i in range (0jinlen (url_list)):

Driver = webdriver.Chrome ()

Driver.maximize_window ()

Url = url_ list [I] .replace ('travel-scenic-spot/mafengwo','mdd/citylist')

Driver.get (url)

While True:

Try:

Time.sleep (2)

Bs = BeautifulSoup (driver.page_source,'html.parser')

Url_set = bs.find_all ('data-type':' journals attrs = {' destination'})

City_name_list = city_name_list + [url_ set [I] .text.replace ('\ nMed split'). Split () [0] for i in range (0Med Len (url_set))]

City_url_list = city_url_list+ [url _ set [I] .attrs ['data-id'] for i in range (0Curren (url_set))]

Js= "var q=document.documentElement.scrollTop=800"

Driver.execute_script (js)

Time.sleep (2)

Driver.find_element_by_class_name ('pg-next') .click ()

Except:

Break

Driver.close ()

Return city_name_list,city_url_list

Url = 'http://www.mafengwo.cn/mdd/'

Url_list = find_cat_url (url)

City_name_list,city_url_list=find_city_url (url_list)

City = pd.DataFrame ({'city':city_name_list,'id':city_url_list})

Second, access to city information

City data is obtained from the following pages:

(a) snack page

(B) attractions page

We encapsulate the process of obtaining data for each city into a function, and each time we pass in the city code obtained before, some of the codes are as follows:

Def get_city_info (city_name,city_code):

This_city_base = get_city_base (city_name,city_code)

This_city_jd = get_city_jd (city_name,city_code)

This_city_jd ['city_name'] = city_name

This_city_jd ['total_city_yj'] = this_city_base [' total_city_yj']

Try:

This_city_food = get_city_food (city_name,city_code)

This_city_food ['city_name'] = city_name

This_city_food ['total_city_yj'] = this_city_base [' total_city_yj']

Except:

This_city_food=pd.DataFrame ()

Return this_city_base,this_city_food,this_city_jd

Def get_city_base (city_name,city_code):

Url = 'http://www.mafengwo.cn/xc/'+str(city_code)+'/'

BsObj = get_static_url_content (url)

Node = bsObj.find ('div', {' class':'m-tags'}). Find ('div', {' class':'bd'}). Find_all ('a')

Tag = [Node [I] .text.split () [0] for i in range (0memlen (node))]

Tag_node = bsObj.find ('div', {' class':'m-tags'}). Find ('div', {' class':'bd'}). Find_all ('em')

Tag_count = [int (k.text) for k in tag_node]

Par = [k.attrs ['href'] [1:3] for k in node]

Tag_all_count = sum ([int (tag_ Count [I]) for i in range (0jue Len (tag_count))])

Tag_jd_count = sum ([int (tag_ Count [I]) for i in range (0jue len (tag_count)) if [I] = = 'jd'])

Tag_cy_count = sum ([int (tag_ Count [I]) for i in range (0jue len (tag_count)) if [I] = = 'cy'])

Tag_gw_yl_count = sum ([int (tag_ Count [I]) for i in range (0paper len (tag_count)) if par [I] in ['gw','yl']])

Url = 'http://www.mafengwo.cn/yj/'+str(city_code)+'/2-0-1.html'

BsObj = get_static_url_content (url)

Total_city_yj = int (bsObj.find ('span', {' class':'count'}). Find_all ('span') [1] .text)

Return {'city_name':city_name,'tag_all_count':tag_all_count,'tag_jd_count':tag_jd_count

'tag_cy_count':tag_cy_count,'tag_gw_yl_count':tag_gw_yl_count

'total_city_yj':total_city_yj}

Def get_city_food (city_name,city_code):

Url = 'http://www.mafengwo.cn/cy/'+str(city_code)+'/gonglve.html'

BsObj = get_static_url_content (url)

Food= [k.text for k in bsObj.find ('ol', {' class':'list-rank'}) .find_all ('h4')]

Food_count= [int (k.text) for k in bsObj.find ('ol', {' class':'list-rank'}) .find_all ('span', {' class':'trend'})]

Return pd.DataFrame ({'food':food [0:len (food_count)],' food_count':food_count})

Def get_city_jd (city_name,city_code):

Url = 'http://www.mafengwo.cn/jd/'+str(city_code)+'/gonglve.html'

BsObj = get_static_url_content (url)

Node=bsObj.find ('div', {' class':'row-top5'}). Find_all ('h4')

Jd = [k.text.split ('\ n') [2] for k in node]

Node=bsObj.find_all ('span', {' class':'rev-total'})

Jd_count= [int (k.text.replace ('comment',')) for k in node]

Return pd.DataFrame ({'jd':jd [0:len (jd_count)],' jd_count':jd_count})

III. Data analysis

PART1: city data

First of all, let's take a look at the city with the largest number of travel notes:

The number of travel notes * 0 is basically consistent with the popular cities we know every day, and we further obtain the geothermal map of the national travel destination according to the number of travel notes in each city:

See here, is there a sense of deja vu? if the footprint you posted on moments matches this picture, then the data of the ant beehive coincides with you.

* Let's take a look at your impression of each city by extracting the attributes in the tag. We divide the attributes into three groups: leisure, diet and scenic spots, and take a look at the cities with the deepest impression under each group of attributes:

It seems that for the users of the ant beehive, Xiamen has left a very deep impression on everyone, not only the sufficient number of travel notes, but also a lot of valid tags that can be extracted from it. Chongqing, Xi'an and Chengdu also left a very deep impression on the foodies, some of the codes are as follows:

Bar1 = Bar ("Catering label ranking") bar1.add ("Catering label score", city_aggregate.sort_values ('cy_point',0,False) [' city_name'] [0:15], city_aggregate.sort_values ('cy_point',0,False) [' cy_point'] [0:15], is_splitline_show = False,xaxis_rotate=30) bar2 = Bar ("Scenic spot tag ranking" Title_top= "30%") bar2.add ("Scenic spot tag score", city_aggregate.sort_values ('jd_point',0,False) [' city_name'] [0:15], city_aggregate.sort_values ('jd_point',0,False) [' jd_point'] [0:15], legend_top= "30%", is_splitline_show = False,xaxis_rotate=30) bar3 = Bar ("Leisure tag ranking" Title_top= "67.5%") bar3.add ("Leisure tag score", city_aggregate.sort_values ('xx_point',0,False) [' city_name'] [0:15], city_aggregate.sort_values ('xx_point',0,False) [' xx_point'] [0:15], legend_top= "67.5%", is_splitline_show = False,xaxis_rotate=30) grid = Grid (height=800) grid.add (bar1) Grid_bottom= "75") grid.add (bar2, grid_bottom= "37.5", grid_top= "37.5") grid.add (bar3, grid_top= "75") grid.render ('city classification tag .html')

PART2: scenic spot data

We extract the number of comments on each scenic spot, compare it with the number of urban travel notes, and get the absolute and relative values of the comments, respectively, and calculate the two scores of popularity and representativeness of the scenic spots. The final ranking of the scenic spots is as follows:

Xitang netizens really have a soft spot for Xiamen, and Gulangyu has also become a popular scenic spot. Xitang Ancient Town and Yang Zhuoyongcuo are among the best in terms of urban representativeness. During the summer vacation, if you are worried that there are too many scenic spots in the upper row, you might as well dig those scenic spots with few people and beautiful scenery from the scenic spots in the lower row.

PART3: snack data

* Let's take a look at the food-related data that everyone is most concerned about. The processing method is similar to that of PART2 scenic spots. Let's take a look at * popular snacks and * urban representative snacks.

Unexpectedly, netizens in ant bee nest really love Xiamen, making sandy tea noodles more popular than hot pot, roast duck and roast bun among popular snacks.

In terms of urban representativeness, the frequency of seafood appearance is very high, which coincides with the perception of ben (ren). Some of the codes of PART2 and 3 are as follows:

Bar1 = Bar ("Scenic spot popularity ranking") bar1.add ("Scenic spot popularity score", city_jd_com.sort_values ('rq_point',0,False) [' jd'] [0:15], city_jd_com.sort_values ('rq_point',0,False) [' rq_point'] [0:15], is_splitline_show = False,xaxis_rotate=30) bar2 = Bar ("Scenic spot Representative ranking" Title_top= 55%) bar2.add ("Scenic spot representation score", city_jd_com.sort_values ('db_point',0,False) [' jd'] [0:15], city_jd_com.sort_values ('db_point',0,False) [' db_point'] [0:15], is_splitline_show = False,xaxis_rotate=30,legend_top= "55%") grid=Grid (height=800) grid.add (bar1 Grid_bottom= "60%") grid.add (bar2, grid_top= "60%", grid_bottom= "10%") grid.render ('scenic spot ranking .html') these are the travel data shared by the editor on how to climb a hornet's nest with Python. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.