Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to crawl the hot words of the title of CSDN hot list through Scrapy framework in Python

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

Editor to share with you how Python through the Scrapy framework to climb CSDN hot list title hot words, I hope you will learn something after reading this article, let's discuss it together!

Environmental deployment

Scrapy installation

Pip install scrapy-I https://pypi.douban.com/simple

Selenium installation

Pip install selenium-I https://pypi.douban.com/simple

Jieba installation

Pip install jieba-I https://pypi.douban.com/simple

IDE:PyCharm

Google chrome driver download corresponding version: google chrome driver download address

Check the browser version and download the corresponding version.

Realization process

Let's get started.

Create a project

Create our project using the scrapy command.

Scrapy startproject csdn_hot_words

The project structure, like the official structure.

Define Item entities

According to the previous logic, the main attribute is the dictionary of the number of occurrences of the title keyword. The code is as follows:

# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html import scrapy class CsdnHotWordsItem (scrapy.Item): # define the fields for your item here like: # name = scrapy.Field () words = scrapy.Field () keyword extraction tool

Use the jieba participle acquisition tool.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 23 utf-8 @ Author: supreme treasure # @ Site: # @ File: analyse_sentence.py import jieba.analyse def get_key_word (sentence): result_dic = {} words_lis = jieba.analyse.extract_tags (sentence, topK=3, withWeight=True, allowPOS= ()) for word Flag in words_lis: if word in result_dic: result_ [word] + = 1 else: result_ [word] = 1 return result_dic crawler construction

Here you need to initialize a browser parameter for the crawler to implement the dynamic loading of the page.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 23 utf-8 @ Author: supreme treasure # @ Site: # @ File: csdn.py import scrapyfrom selenium import webdriverfrom selenium.webdriver.chrome.options import Options from csdn_hot_words.items import CsdnHotWordsItemfrom csdn_hot_words.tools.analyse_sentence import get_key_word class CsdnSpider (scrapy.Spider): name = 'csdn '# allowed_domains = [' blog.csdn.net'] start_urls = ['https://blog.csdn.net/rank/list'] def _ _ init__ (self): chrome_options = Options () chrome_options.add_argument ('-- headless') # use headless Google browser mode chrome_options.add_argument ('--disable-gpu') chrome _ options.add_argument ('--no-sandbox') self.browser = webdriver.Chrome (chrome_options=chrome_options Executable_path= "E:\\ chromedriver_win32\ chromedriver.exe") self.browser.set_page_load_timeout (30) def parse (self, response * * kwargs): titles = response.xpath ("/ / div [@ class='hosetitem-title'] / a/text ()") for x in titles: item = CsdnHotWordsItem () item ['words'] = get_key_word (x.get ()) yield item

Code description

1. The headless mode of chrome is used here, so there is no need for a browser to open it and access it. It is all executed in the background.

2. You need to add the address of the execution file of chromedriver.

3. In the parse section, you can refer to the xpath of my article, get the title and call keyword extraction to construct the item object.

Middleware code construction

Add js code execution content. Complete code of middleware:

# Define here the models for your spider middleware## See documentation in:# https://docs.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signalsfrom scrapy.http import HtmlResponsefrom selenium.common.exceptions import TimeoutExceptionimport time from selenium.webdriver.chrome.options import Options # useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapter class CsdnHotWordsSpiderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the spider middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_spider_input (self, response, spider): # Called for each response that goes through the spider # middleware and into the spider. # Should return None or raise an exception. Return None def process_spider_output (self, response, result, spider): # Called with the results returned from the Spider, after # it has processed the response. # Must return an iterable of Request, or item objects. For i in result: yield i def process_spider_exception (self, response, exception, spider): # Called when a spider or process_spider_input () method # (from other spider middleware) raises an exception. # Should return either None or an iterable of Request or item objects. Pass def process_start_requests (self, start_requests, spider): # Called with the start requests of the spider, and works # similarly to the process_spider_output () method, except # that it doesn't have a response associated. # Must return only requests (not items). For r in start_requests: yield r def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) class CsdnHotWordsDownloaderMiddleware: # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. Classmethod def from_crawler (cls, crawler): # This method is used by Scrapy to create your spiders. S = cls () crawler.signals.connect (s.spider_opened, signal=signals.spider_opened) return s def process_request (self, request, spider): js = 'let height = 0 let interval = setInterval (() = > {window.scrollTo ({top: height) Behavior: "smooth"}) Height + = 500}, 500); setTimeout (() = > {clearInterval (interval)}, 20000) Try: spider.browser.get (request.url) spider.browser.execute_script (js) time.sleep (20) return HtmlResponse (url=spider.browser.current_url, body=spider.browser.page_source, encoding= "utf-8" Request=request) except TimeoutException as e: print ('timeout exception: {}' .format) spider.browser.execute_script ('window.stop ()') finally: spider.browser.close () def process_response (self, request, response, spider): # Called with the response returned from the downloader. # Must either; #-return a Response object #-return a Request object #-or raise IgnoreRequest return response def process_exception (self, request, exception, spider): # Called when a download handler or a process_request () # (from other downloader middleware) raises an exception. # Must either: #-return None: continue processing this exception #-return a Response object: stops process_exception () chain #-return a Request object: stops process_exception () chain pass def spider_opened (self, spider): spider.logger.info ('Spider opened:% s'% spider.name) to create a custom pipeline

Define the final result output to a file according to word frequency statistics. The code is as follows:

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html # useful for handling different item types with a single interfacefrom itemadapter import ItemAdapter class CsdnHotWordsPipeline: def _ _ init__ (self): self.file = open ('result.txt', 'walled, encoding='utf-8') self.all_words = [] def process_item (self, item) Spider): self.all_words.append (item) return item def close_spider (self, spider): key_word_dic = {} for y in self.all_words: print (y) for k V in y ['words'] .items (): if k.lower () in key_word_dic: key_word_dic [k.lower ()] + = v else: key_word_dic [k.lower ()] = v word_count_sort = sorted (key_word_dic.items ()) Key=lambda x: X [1], reverse=True) for word in word_count_sort: self.file.write ('{}, {}\ n'.format (word [0], word [1])) self.file.close () settings configuration

Some adjustments should be made to the configuration. Adjust as follows:

# Scrapy settings for csdn_hot_words project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'csdn_hot_words' SPIDER_MODULES = [' csdn_hot_words.spiders'] NEWSPIDER_MODULE = 'csdn_hot_words.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent# USER_AGENT = 'csdn_hot_words (+ http://www.yourdomain.com)'USER_AGENT =' Mozilla/5.0' # Obey robots.txt rulesROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docsDOWNLOAD_DELAY = 3percent The download delay setting will honor only one of:# CONCURRENT_REQUESTS_PER_DOMAIN = "CONCURRENT_REQUESTS_PER_IP =" Disable cookies (enabled by default) COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) # TELNETCONSOLE_ENABLED = False # Override the default request headers:DEFAULT_REQUEST_HEADERS = {'Accept':' text/html Application/xhtml+xml,application/xml "Accept-Language': 'en',' User-Agent': 'Mozilla/5.0 (Windows NT 6.2)" WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36'} # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlSPIDER_MIDDLEWARES = {'csdn_hot_words.middlewares.CsdnHotWordsSpiderMiddleware': 543,} # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.htmlDOWNLOADER_MIDDLEWARES = {' csdn_hot_words.middlewares.CsdnHotWordsDownloaderMiddleware': 543 } # Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html# EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#} # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {' csdn_hot_words.pipelines.CsdnHotWordsPipeline': 300 } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html# AUTOTHROTTLE_ENABLED = True# The initial download delay# AUTOTHROTTLE_START_DELAY = "The maximum download delay to be set in case of high latencies# AUTOTHROTTLE_MAX_DELAY = 6" The average number of requests Scrapy should be sending in parallel to# each remote server# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 Enable showing throttling stats for every response received:# AUTOTHROTTLE_DEBUG = False # Enable And configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings# HTTPCACHE_ENABLED = True# HTTPCACHE_EXPIRATION_SECS = executing HTTPCACHE_DIR = 'httpcache'# HTTPCACHE_IGNORE_HTTP_CODES = [] # HTTPCACHE_STORAGE =' scrapy.extensions.httpcache.FilesystemCacheStorage' executes the main program

It can be executed through the command of scrapy, but for the convenience of reading the log, a main program code is added.

#! / usr/bin/env python#-*-coding: utf-8-*-# @ Time: 2021-11-5 22 utf-8-*-# @ Author: supreme treasure # @ Site: # @ File: main.pyfrom scrapy import cmdline cmdline.execute ('scrapy crawl csdn'.split ()) execution result

Executive part log

The result.txt results were obtained.

After reading this article, I believe you have a certain understanding of "how Python crawls the title of the CSDN hot list through the Scrapy framework". If you want to know more about it, welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report