In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly explains "how to use the Python Scrap framework to crawl the data of a food forum". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let the editor take you to learn "how to use the Python Scrap framework to crawl the data of a food forum"!
I. Preface
Web crawler (also known as web spider, web robot) is a program or script that automatically grabs the information of the World wide Web according to certain rules. Other infrequently used names include ants, automatic indexing, simulators, or worms.
In human words, crawlers are used for massive regular acquisition of data, and then processed and applied, which is one of the necessary supporting conditions in big data, finance, machine learning and so on.
At present, in first-tier cities, the salary and treatment of reptiles are relatively objective, and then promoted to medium and senior reptile engineers, data analysts, big data development positions, etc., are a very good transition.
II. Project objectives
In fact, the project introduced does not need to be too complicated, and the ultimate goal is to crawl every comment of the post into the database, and to update the data, prevent repeated crawling, anti-crawling and other measures.
III. Project preparation
Software: PyCharm
Required libraries: Scrapy, selenium, pymongo, user_agent,datetime
Target website:
Http://bbs.foodmate.net
Plug-in: chromedriver (version to be correct)
IV. Project analysis
1. Determine the structure of the crawled website
In short: determine the loading mode of the website, how to correctly enter the post to grab data, what format to use to save data, and so on.
Secondly, observe the hierarchical structure of the website, that is, how to enter the post page little by little according to the section, which is very important for this crawler task and is the main part of writing code.
2. How to choose the right way to crawl data?
The crawler methods I know so far are probably as follows (incomplete, but more commonly used):
1) request framework: the use of this http library can be very flexible to crawl the required data, simple but slightly cumbersome process, and can cooperate with the package grab tool to obtain the data. However, you need to determine the headers header and the corresponding request parameters, otherwise the data cannot be obtained. Many app crawling and image and video crawling are relatively lightweight and flexible, and the high concurrency and distributed deployment is also very flexible, so the function can be better realized.
2) scrapy framework: scrapy framework can be said to be the most commonly used crawler framework, it is best to use the crawler framework, with many advantages: scrapy is asynchronous; take a more readable xpath instead of regular; powerful statistics and log system; at the same time crawl on different url; support shell way to facilitate independent debugging; support to write middleware to facilitate writing some unified filters; can be stored in the database through the pipeline and so on. This is also the framework (combined with the selenium library) that this article will introduce.
V. Project realization
1. The first step: determine the type of website
First of all, explain what it means, what website to see, first of all depends on the loading mode of the website, whether it is static loading, dynamic loading (js loading), or other ways; different loading methods require different ways to deal with. Then we look at the website we crawled today and found that this is a forum with a sense of age, first of all, we guess that it is a statically loaded website; we open the plug-ins loaded by the organization js, as shown in the following figure.
After refreshing, it is found that it is indeed a static website (basically statically loaded if it can be loaded normally).
2. Step 2: determine the hierarchical relationship
Secondly, the website we are going to crawl today is the Food Forum website, which is statically loaded, which was already understood at the time of previous analysis, and then the hierarchical structure:
Part of the code shows:
First-level interface:
Def parse (self, response): self.logger.info ("entered the web page!") Self.logger.info ("getting block list!") Column_path_list = response.css ('# ct > div.mn > div:nth-child (2) > div') [:-1] for column_path in column_path_list: col_paths = column_path.css ('div > table > tbody > tr > td > div > a'). Xpath (' @ href'). Extract () for path in col_paths: block_url = response.urljoin (path) Yield scrapy.Request (url=block_url Callback=self.get_next_path
Secondary interface:
Def get_next_path (self, response): self.logger.info ("entered the board!") Self.logger.info ("getting article list!") If response.url = 'http://www.foodmate.net/know/': pass else: try: nums = response.css (' # fd_page_bottom > div > label > span::text'). Extract_first (). Split ('') [- 2] except: nums = 1 for num in range (1 Int (nums) + 1): tbody_list = response.css ('# threadlisttableid > tbody') for tbody in tbody_list: if 'normalthread' in str (tbody): item = LunTanItem () item [' article_url'] = response.urljoin ('* > tr > th) > a.s.xst') .xpath ('@ href') .extract_first () item ['type'] = response.css (' # ct > div > div.bm.bml.pbn > div.bm_h.cl > H2 > an extract_first item ['title'] = tbody.css (' * > tr > th >) Extract_first () item ['spider_type'] = "forum" item [' source'] = "food forum" if item ['article_url']! =' http://bbs.foodmate.net/': yield scrapy.Request ( Url=item ['article_url'] Callback=self.get_data, meta= {'item': item 'content_info': []}) try: callback_url = response.css (' # fd_page_bottom > div > a.nxt'). Xpath ('@ href'). Extract_first () callback_url = response.urljoin (callback_url) yield scrapy.Request (url=callback_url, callback=self.get_next_path) ) except IndexError: pass
Three-level interface:
Def get_data (self, response): self.logger.info ("crawling forum data!") Item = response.meta ['item'] content_list = [] divs = response.xpath (' / / * [@ id= "postlist"] / div') user_name = response.css ('div > div.pi > div:nth-child (1) > extract () publish_time = response.css (' div.authi > em::text'). Extract floor = divs.css ('* strong > a > em::text') .extract () s_id = divs.xpath ('@ id') .extract () for i in range (len (divs)-1): content =''try: strong = response.css (' # postmessage_' + sid [I] .split ('_') [- 1] +') .xpath ('string (.)') .extract () for s in strong: Content + = s.split (' ') [- 1] .lstrip ('\ r\ n') datas = dict (content=content, # content reply_id=0, # reverted floor, default 0 user_name=user_name [I], # roomname publish_time=publish_ [I] .split ('in') [- 1] #% Y-%m-%d% HRV% MVA% S'id='#' + floor [I] # floor) content_list.append (datas) except IndexError: pass item ['content_info'] = response.meta [' content_info'] item ['scrawl_time'] = datetime.now (). Strftime ('% Y-%m-%d% H15% M15% S') item ['content_info'] + = content_list Data_url = response.css ('# ct > div.pgbtn > a'). Xpath ('@ href') .extract_first () if data_url! = None: data_url = response.urljoin (data_url) yield scrapy.Request (url=data_url) Callback=self.get_data, meta= {'item': item,' content_info': item ['content_info']}) else: item [' scrawl_time'] = datetime.now (). Strftime ('% Y-%m-%d% H15% MHD% S') self.logger.info ("storing!") Print ('saved successfully') yield item
3. Step 3: determine the crawling method
As it is a static web page, the first decision is to use the scrapy framework to obtain data directly, and through the previous test found that the method is really feasible, but at that time young and frivolous, underestimated the website protection measures, due to limited patience, did not add a timer to limit the crawling speed, resulting in I was limited by the site, and the site changed from static loading page to: dynamically loading page verification algorithm and then entering the page Direct access will be denied by the background.
But this kind of question how can I be so clever, after my brief thinking (1 day), I changed the program to scrapy framework + selenium library method, by calling chromedriver, simulated visit to the website, and then crawled after the site was loaded, the follow-up proved that this method was indeed feasible and efficient.
The code for the implementation is as follows:
Def process_request (self, request, spider): chrome_options= Options () chrome_options.add_argument ('--headless') # use headless Google browser mode chrome_options.add_argument ('--disable-gpu') chrome_options.add_argument ('--no-sandbox') # to specify the Google browser path self.driver = webdriver.Chrome (chrome_options=chrome_options) Executable_path='E:/pycharm/workspace/ crawler / scrapy/chromedriver') if request.url! = 'http://bbs.foodmate.net/': self.driver.get (request.url) html = self.driver.page_source time.sleep (1) self.driver.quit () return scrapy.http.HtmlResponse (url=request.url, body=html.encode (' utf-8') Encoding='utf-8', request=request)
Step 4: determine the storage format of the crawled data
Needless to say in this part, according to your own needs, the data you need to crawl will be formatted in items.py. You can save by referencing this format in the project:
Class LunTanItem (scrapy.Item): "" Forum Field "title = Field () # str: character type | Forum title content_info = Field () # str: list type | Type list: [LunTanContentInfoItem1" LunTanContentInfoItem2] article_url = Field () # str: url | article link scrawl_time = Field () # str: time format refer to the following format: 2019-08-01 10:20:00 | data crawl time source = Field () # str: character type | Forum name eg: unknown BBS, Shuimu Community, Tianya Forum type = Field () # str: character type | Section type eg: 'Caijing' 'sports', 'social' spider_type = Field () # str: forum | only write 'forum''
Step 5: make sure to save the database
The database selected for this project is mongodb. Because it is a non-relational database, the advantage is obvious, the format requirement is not so high, and it can store multi-dimensional data flexibly, generally a crawler preferred database (don't talk to me about redis, I will also use it, mainly will not)
Code:
Import pymongoclass FMPipeline (): def _ init__ (self): super (FMPipeline, self). _ _ init__ () # client = pymongo.MongoClient ('139.217.92.75') client = pymongo.MongoClient (' localhost') db = client.scrapy_FM self.collection = db.FM def process_item (self, item) Spider): query = {'article_url': item [' article_url']} self.collection.update_one (query, {"$set": dict (item)}, upsert=True) m
At this time, a smart friend will ask: what if you crawl to the same data twice? (in other words, duplicate checking function)
I didn't think about this question before, but later, when I asked the boss, I knew that we had already done this when we saved the data. This is the sentence:
Query = {'article_url': item [' article_url']} self.collection.update_one (query, {"$set": dict (item)}, upsert=True)
Through the link of the post to determine whether there is data crawling repetition, if the repetition can be understood as overwriting it, so that the data can also be updated.
6. Other settings
Such as multithreading, headers header, pipeline transmission order and other issues, are set in the settings.py file, specific can refer to the editor's project to see, here will not repeat.
VII. Effect display
1. Click run, and the result is displayed in the console, as shown in the following figure.
2, the middle will always pile up a lot of posts in the queue of crawling tasks, and then multi-threaded processing, I set up 16 threads, the speed is still considerable.
All the messages of each post and the public information of the relevant users are stored in content_info.
At this point, I believe you have a deeper understanding of "how to use the Python Scrap framework to crawl a food forum data". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.