In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Editor to share with you how to use the scrapy framework to climb the data of Meituan website, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!
Recently, crawlers are exploring the use of scrapy framework to crawl the data of Meituan's website.
The first step is to start climbing with regional information. Open Meituan's official website, click to switch regions, press F12, and click XHR,XHR to filter out asynchronous requests. In this way, we enlarge the json data of Meituan's regional information and copy the link http://www.meituan.com/ptapi/getprovincecityinfo/.
Parsing the json data will get the information of some regions and counties, but this is not conducive to subsequent crawling, which will be repeated crawling. I filter out the city-level information, and then use the middle area classification information of the page to crawl.
Save the obtained data to the MongoDB database
First save the province, then the city, then the district, then the street, and then crawl the data according to the url of the street.
This is the code to get the province and city.
After observing the json data obtained, it is found that the previous information is at the municipal level, so the configuration file is established by the number of cities in each province and obtained through the configuration file.
By reading the configuration file, filter out the districts and counties, leaving all the information at the city level
Read the configuration using the configparser module. Save to database
The scrapy framework obeys robot.txt rules, so access is denied, which is set in setting
ROBOTSTXT_OBEY = False
In order to avoid request 403 errors, colleagues continue to set setting
''
Forge a user information to prevent 403
''
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 300
}
''
Prevent 403 from crashing.
''
HTTPERROR_ALLOWED_CODES = [403]
The above is all the contents of the article "how to use the scrapy framework to crawl the data of Meituan's website". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.