In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
Editor to share with you how Python through the Scrapy framework to climb CSDN hot list title hot words, I hope you will learn something after reading this article, let's discuss it together!
Environmental deployment
Scrapy installation
Pip install scrapy-I https://pypi.douban.com/simple
Selenium installation
Pip install selenium-I https://pypi.douban.com/simple
Jieba installation
Pip install jieba-I https://pypi.douban.com/simple
IDE:PyCharm
Google chrome driver download corresponding version: google chrome driver download address
Check the browser version and download the corresponding version.
Realization process
Let's get started.
Create a project
Create our project using the scrapy command.
Scrapy startproject csdn_hot_words
The project structure, like the official structure.
Define Item entities
According to the previous logic, the main attribute is the dictionary of the number of occurrences of the title keyword. The code is as follows:
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CsdnHotWordsItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () words = scrapy.Field () keyword extraction tool
Use the jieba participle acquisition tool.
#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 23 utf-8 @ Author: supreme treasure # @ Site: # @ File: analyse_sentence.py import jieba.analyse def get_key_word (sentence): result_dic = {} words_lis = jieba.analyse.extract_tags (sentence, topK=3, withWeight=True, allowPOS= ()) for word Flag in words_lis: if word in result_dic: result_ [word] + = 1 else: result_ [word] = 1 return result_dic crawler construction
Here you need to initialize a browser parameter for the crawler to implement the dynamic loading of the page.
#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 23 utf-8 @ Author: supreme treasure # @ Site: # @ File: csdn.py import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from csdn_hot_words.items import CsdnHotWordsItemfrom csdn_hot_words.tools.analyse_sentence import get_key_word class CsdnSpider (scrapy.Spider): name = 'csdn '# allowed_domains = [' blog.csdn.net'] start_urls = ['https://blog.csdn.net/rank/list'] def _ _ init__ (self): chrome_options = Options () chrome_options.add_argument ('-- headless') # use headless Google browser mode chrome_options.add_argument ('--disable-gpu') chrome _ options.add_argument ('--no-sandbox') self.browser = webdriver.Chrome (chrome_options=chrome_options Executable_path= "E:\\ chromedriver_win32\ chromedriver.exe") self.browser.set_page_load_timeout (30) def parse (self, response * * kwargs): titles = response.xpath ("/ / div [@ class='hosetitem-title'] / a/text ()") for x in titles: item = CsdnHotWordsItem () item ['words'] = get_key_word (x.get ()) yield item
Code description
1. The headless mode of chrome is used here, so there is no need for a browser to open it and access it. It is all executed in the background.
2. You need to add the address of the execution file of chromedriver.
3. In the parse section, you can refer to the xpath of my article, get the title and call keyword extraction to construct the item object.
Middleware code construction
Add js code execution content. Complete code of middleware:
# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signalsfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutExceptionimport time from selenium.webdriver.chrome.options import Options # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapter class CsdnHotWordsSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_spider_input (self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. Return None def process_spider_output (self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. For i in result: yield i def process_spider_exception (self, response, exception, spider): # Called when a spider or process_spider_input () method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. Pass def process_start_requests (self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output () method, except # that it doesn't have a response associated. # Must return only requests (not items). For r in start_requests: yield r def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) class CsdnHotWordsDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_request (self, request, spider): js = 'let height = 0 let interval = setInterval (() = > {window.scrollTo ({top: height) Behavior: "smooth"}) Height + = 500}, 500); setTimeout (() = > {clearInterval (interval)}, 20000) Try: spider.browser.get (request.url) spider.browser.execute_script (js) time.sleep (20) return HtmlResponse (url=spider.browser.current_url, body=spider.browser.page_source, encoding= "utf-8" Request=request) except TimeoutException as e: print ('timeout exception: {}' .format) spider.browser.execute_script ('window.stop ()') finally: spider.browser.close () def process_response (self, request, response, spider): # Called with the response returned from the downloader. # Must either; #-return a Response object #-return a Request object #-or raise IgnoreRequest return response def process_exception (self, request, exception, spider): # Called when a download handler or a process_request () # (from other downloader middleware) raises an exception. # Must either: #-return None: continue processing this exception #-return a Response object: stops process_exception () chain #-return a Request object: stops process_exception () chain pass def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) to create a custom pipeline
Define the final result output to a file according to word frequency statistics. The code is as follows:
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter class CsdnHotWordsPipeline: def _ _ init__ (self): self.file = open ('result.txt', 'walled, encoding='utf-8') self.all_words = [] def process_item (self, item) Spider): self.all_words.append (item) return item def close_spider (self, spider): key_word_dic = {} for y in self.all_words: print (y) for k V in y ['words'] .items (): if k.lower () in key_word_dic: key_word_dic [k.lower ()] + = v else: key_word_dic [k.lower ()] = v word_count_sort = sorted (key_word_dic.items ()) Key=lambda x: X [1], reverse=True) for word in word_count_sort: self.file.write ('{}, {}\ n'.format (word [0], word [1])) self.file.close () settings configuration
Some adjustments should be made to the configuration. Adjust as follows:
# Scrapy settings for csdn_hot_words project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'csdn_hot_words' SPIDER_MODULES = [' csdn_hot_words.spiders'] NEWSPIDER_MODULE = 'csdn_hot_words.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent# USER_AGENT = 'csdn_hot_words (+ http://www.yourdomain.com)'USER_AGENT =' Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docsDOWNLOAD_DELAY = 3percent The download delay setting will honor only one of:# CONCURRENT_REQUESTS_PER_DOMAIN = "CONCURRENT_REQUESTS_PER_IP =" Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept':' text/html Application/xhtml+xml,application/xml "Accept-Language': 'en',' User-Agent': 'Mozilla/5.0 (Windows NT 6.2)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = {'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {' csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543 } # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html# EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {' csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html# AUTOTHROTTLE_ENABLED = True# The initial download delay# AUTOTHROTTLE_START_DELAY = "The maximum download delay to be set in case of high latencies# AUTOTHROTTLE_MAX_DELAY = 6" The average number of requests Scrapy should be sending in parallel to# each remote server# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 Enable showing throttling stats for every response received:# AUTOTHROTTLE_DEBUG = False # Enable And configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings# HTTPCACHE_ENABLED = True# HTTPCACHE_EXPIRATION_SECS = executing HTTPCACHE_DIR = 'httpcache'# HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE =' scrapy.extensions.httpcache.FilesystemCacheStorage' executes the main program
It can be executed through the command of scrapy, but for the convenience of reading the log, a main program code is added.
#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 22 utf-8-*-# @ Author: supreme treasure # @ Site: # @ File: main.pyfrom scrapy import cmdline cmdline.execute ('scrapy crawl csdn'.split ()) execution result
Executive part log
The result.txt results were obtained.
After reading this article, I believe you have a certain understanding of "how Python crawls the title of the CSDN hot list through the Scrapy framework". If you want to know more about it, welcome to follow the industry information channel, thank you for reading!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.