In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Editor to share with you how to classify python reptiles, I believe most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!
1. According to the purpose, it can be divided into functional crawler and data incremental crawler.
2. According to whether the url address and the corresponding page content change, the data incremental crawler can be divided into the crawler whose address changes and the crawler whose address changes.
Example
# 1.spider file import scrapyfrom movieAddPro.items import MovieaddproItemfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom redis import Redis class MovieaddSpider (CrawlSpider): name = 'movieadd' # allowed_domains = [' www.xxx.com'] start_urls = ['https://www.4567tv.tv/frim/index1.html'] link = LinkExtractor (allow=r'.frim/index1-\ ding.html') rules = (Rule (link, callback='parse_item', follow=True)) ) # create reids connection object conn = Redis (host='127.0.0.1',port=6379) # parse the name of the movie and the url def parse_item of the details page (self Response): li_list = response.xpath ('/ html/body/div [1] / div/div/div/div [2] / ul/li') for li in li_list: title = li.xpath ('. / div/a/@title'). Extract_first () # get the details page url detail_url = 'https://www.4567tv.tv' + li.xpath ('. / div/a/@href') .extract_first () item = MovieaddproItem () item ['title'] = title # determine whether the url of the detail page makes a request to send ex = self.conn.sadd (' movieadd_detail_urls' Detail_url) if ex = = 1: # indicates that the set collection of redis does not exist before detail_url Need to send request print ('new data update, crawling data.') Yield scrapy.Request (url=detail_url,callback=self.parse_detail,meta= {'item':item}) else: print (' there are no new data updates yet.') Def parse_detail (self Response): item = response.meta ['item'] desc = response.xpath (' / html/body/div [1] / div/div/div/div [2] / p [5] / span [3] / text ()'). Extract_first () item ['desc'] = desc yield item-- -# 2.pipelines file class MovieaddproPipeline (object): def process_item (self Item, spider): dic = {'title':item [' title'], 'desc':item [' desc']} print (dic) conn = spider.conn conn.lpush ('movieadd_data') Dic) return item----# 3.items file import scrapy class MovieaddproItem (scrapy.Item): title = scrapy.Field () desc = scrapy.Field ()-# 4.setting file BOT_NAME = 'movieAddPro' SPIDER_MODULES = [' movieAddPro.spiders'] NEWSPIDER_MODULE = 'movieAddPro.spiders' USER_AGENT =' Mozilla/5.0 (Windows NT 10.0 Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36' ROBOTSTXT_OBEY = False LOG_LEVEL = 'ERROR' ITEM_PIPELINES = {' movieAddPro.pipelines.MovieaddproPipeline': 300,}-requirements: crawl the data of jokes and authors in the encyclopedia of embarrassing stories. # 1.spider file import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom incrementByDataPro.items import IncrementbydataproItemfrom redis import Redisimport hashlib class QiubaiSpider (CrawlSpider): name = 'qiubai' start_urls = [' https://www.qiushibaike.com/text/'] rules = (Rule (LinkExtractor (allow=r'/text/page/\ dhammer), callback='parse_item', follow=True), Rule (LinkExtractor (allow=r'/text/$'), callback='parse_item', follow=True)) ) # create redis link object conn = Redis (host='127.0.0.1',port=6379) def parse_item (self Response): div_list = response.xpath ('/ / div [@ id= "content-left"] / div') for div in div_list: item = IncrementbydataproItem () item ['author'] = div.xpath ('. / div [1] / a [2] / h3/text () |. / div [1] / span [2] / h3/text ()'). Extract_first () Item ['content'] = div.xpath ('. / / div [@ class= "content"] / span/text ()'). Extract_first () # generates a unique identity from the parsed data values for redis storage source = item ['author'] + item [' content'] source_id = hashlib.sha256 (source.encode ()). Hexdigest () The unique representation of the parsed content is stored in the data_id of redis ex = self.conn.sadd ('data_id' Source_id) if ex = = 1: print ('the data has not been crawled Can crawl.') Yield item else: print ('this data has already been crawled No need to crawl again!!')-# 2.pipelines file from redis import Redisclass IncrementbydataproPipeline (object): conn = None def open_spider (self) Spider): self.conn = Redis (host='127.0.0.1', port=6379) def process_item (self, item, spider): dic = {'author': item [' author'], 'content': item [' content']} print (dic) self.conn.lpush ('qiubaiData') Dic) return item above is all the content of the article "how to classify python crawlers" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.