In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
How to use python code to crawl all Pizza Hut restaurant information in the country, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.
When I first came into contact with Python, I was already fascinated by Python. What attracts me about Python is not only to write web crawlers, but also to analyze data. I can present a large amount of data graphically and interpret the data more intuitively.
The premise of data analysis is that there is data to analyze. What if there is no data? First, you can go to some data sites to download relevant data, but the data content may not be what you want. The second is to crawl some website data.
Today, I will crawl all the Pizza Hut restaurant information across the country for follow-up data analysis.
01 grab the target
The target we are going to climb is Pizza Hut China. Open the Pizza Hut China home page and go to the "Restaurant Enquiry" page.
The data we want to crawl include the city, the name of the restaurant, the address of the restaurant and the contact number of the restaurant. Because I see a map on the page, the page must have the longitude and latitude of the restaurant address. Therefore, the longitude and latitude of the restaurant is also the data we need to crawl.
As for the list of cities with Pizza Hut restaurants in the country, we can get it through the "switch cities" on the page.
02 Analysis item page
Before writing a crawler, I simply analyze the page and then specify the idea of crawling. And the analysis of the page structure will often have some unexpected gains.
We use the developer tools of the browser to simply analyze the page structure.
We can find the data we need in the StoreList page. This determines the Xpath syntax for data extraction.
The Response content of the StoreList page is long. Let's not rush to close the page and look down to see if there is anything else available. *, we find the JavaScript function code that calls to get the restaurant list information.
Let's go on to search the GetStoreList function to see how the browser gets the restaurant list information.
From the code, we can see that the page uses Ajax to get the data. The page requests the address http://www.pizzahut.com.cn/StoreList/Index in POST mode. At the same time, the request comes with the parameters pageIndex and pageSize.
03 crawling ideas
After some page structure analysis, we specify the crawling idea. First of all, let's get the city information. Then use it as a parameter to build a HTTP request to access the Pizza Hut server to get the data of all the restaurants in the current city.
In order to facilitate data crawling, I wrote all the cities to cities.txt. When we want to crawl the data, we read the city information from the file.
There seems to be nothing wrong with crawling, but there is still a problem that has not been solved. Every time we open the Pizza Hut website, the page automatically locates to our city. If we can't crack the city location problem, we can only grab one city data.
So we scanned the home page again to see if we could find some available information. Finally, we find that there is an iplocation field in the cookies of the page. I decode it with Url to get messages like Shenzhen | 0 | 0.
Seeing this message, it dawned on me. It turns out that Pizza Hut sets the initial city information based on our IP address. If we can forge the iplocation field information, we can modify the city at will.
04 code implementation
The * * step is to read the city information from the file.
# cities with Pizza Hut restaurants in the country, I put the cities in the file, cities = [] def get_cities (): "get the city from the file" file_name = 'cities.txt' with open (file_name,' rushing, encoding='UTF-8-sig') as file: for line in file: city = line.replace (''" '') cities.append (city)
The second step is to iterate through the cities list in turn, taking each city as a parameter to construct the iplocation field of the Cookies.
# traversing restaurants in all cities for city in cities: restaurants = get_stores (city, count) results [city] = restaurants count + = 1 time.sleep (2)
Then, we take the Cookie in POST to request the Pizza Hut server. * * then extract the data from the returned page.
Def get_stores (city, count): "" get restaurant information based on city "" session = requests.Session () # encode [city | 0 | 0] with Url city_urlencode = quote (city +'| 0 | 0') # cookies cookies = requests.cookies.RequestsCookieJar () headers = {'User-agent':' Mozilla/5.0 (Windows NT 6.3) used to store the home page WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36', 'accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/* Print, 'Host':' www.pizzahut.com.cn', 'Cache-Control':' max-age=0', 'Connection':' keep-alive',} print ('= first', count, 'city:', city,'=') resp_from_index = session.get Headers=headers) # print (resp_from_index.cookies) # then change the iplocation field of the original cookies Set yourself to want to capture the city. Cookies.set ('AlteonP', resp_from_index.cookies [' AlteonP'], domain='www.pizzahut.com.cn') cookies.set ('iplocation', city_urlencode, domain='www.pizzahut.com.cn') # print (cookies) page = 1 restaurants = [] while True: data = {' pageIndex': page, 'pageSize': "50" } response = session.post ('http://www.pizzahut.com.cn/StoreList/Index', headers=headers, data=data Cookies=cookies) html = etree.HTML (response.text) # get the div tag divs = html.xpath ("/ / div [@ class='re_RNew']") temp_items = [] for div in divs: item = {} content = div.xpath ('. / @ onclick') [0] # ClickStore ('22.538912114.09803 | City Square | second floor, CITIC City Plaza, Shennan Middle Road | 0755-25942012' 'GZH519') # filter out parentheses and the following content content = content.split (' ('') [1] .split (')) [0] .split ('') '') [0] if len (content.split ('|')) = 4: item ['coordinate'] = content.split (' |') [0] item ['restaurant_name'] = content.split (' |') [1] + 'restaurant' item ['address'] = content.split (' |') [2] Item ['phone'] = content.split (' |') [3] else: item ['restaurant_name'] = content.split (' |') [0] + 'restaurant' item ['address'] = content.split (' |') [1] item ['phone'] = content.split (' |') [2] print ( Item) temp_items.append (item) if not temp_items: break restaurants + = temp_items page + = 1 time.sleep (5) return restaurants
The third step is to write the information of the city and all the restaurants in the city into the Json file.
With open ('results.json',' indent=4, encoding='UTF-8') as file: file.write (results, indent=4, ensure_ascii=False))
05 crawl result
After the program is run, a file named "results.json" is generated in the current directory.
Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.