In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the knowledge of "how to use the Python Scrapy crawler framework". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Project creation
It is very easy to create a Scrapy project. For quick creation, you can enter the following code directly through the terminal:
Scrapy startproject zhuanti_new
If you want to create this project in a different file, you need to find the corresponding file path first. Of course, you can also create it directly in the corresponding file through pycharm, and click the terminal in the lower left corner to create the project. The project will directly create a Scrapy project in the corresponding project file.
2. Introduction of Scrapy project file
From the screenshot below, you can see which files a Scrapy project has and which files need to be created and generated, which are described one by one.
(1) the topmost zhuanti_new folder is the project name of Scrapy
(2) there are 4 documents on the second floor:
First: the file with the same name as the project is what we usually call the crawler package. All the crawler code is in this package.
The second: the mian file, which is the main function code file used to run the project. After the code is written, it will be run through this file as a whole.
The third: configuration file, which states that the location of the default setting file is the settings file under the zhuanti_new module, and defines the project name as: zhuanti_new.
No. 4: txt file for storing crawl results
Let's introduce the code key files in the first file one by one:
(1) items.py file: defines the field information crawled by the crawler
(2) pipelines.py files: mainly used for data processing, cleaning and storage
(3) settings.py: mainly used for setting request headers, alarm processing and other related issues
(4) zhuantispider.py file: a code file that focuses on the process of crawling data, which is also a newly created file
The code is as follows:
From zhuanti_new.items import ZhuantiNewItemimport scrapyfrom scrapy.selector import Selector
Class JianshuSpiderSpider (scrapy.Spider): name = 'zhuantispider' allowed_domains = [' jianshu.com'] start_urls = ['https://www.jianshu.com/recommendations/collections?page=1&order_by=hot'] # print (start_urls)
Def parse (self, response):''selector = Selector (response) partical_urls = selector.re (', selector) for url in partical_urls: print (url) right_url = response.urljoin (url) # print (right_url) parts = [? order_by=added_at&page= {0} '.format (k) for k in range (1) 11)] for part in parts: # to crawl the first 10 articles of each topic real_url = right_url + part # print (real_url) yield scrapy.Request (real_url, callback=self.parse_detail)
Links = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(i) for i in range (2,3)] for link in links: print (link) request = scrapy.Request (link, callback=self.parse) yield request
Def parse_detail (self Response): selector = Selector (response) content = selector.xpath ('/ / div [@ class= "content"]') for detail in content: try: title = detail.xpath ('a [1] / text ()'). Extract_first () summary = detail.xpath ('p/text ()'). Extract_first (). Strip () author = detail .XPath ('div/a [1] / text ()'). Extract_first () # comments = detail.xpath ('div/a [2] / text ()'). Extract_first () it is not possible to comments = detail.xpath ('div/a [2] / text ()'). Extract () [1] .strip () likes = detail.xpath ('div/span [1] /) Text ()'. Extract_first (). Strip () money = detail.xpath ('div/span [2] / text ()'). Extract_first ()
Item = ZhuantiNewItem () if money is not None: item ['title'] = title item [' summary'] = summary item ['author'] = author item [' comments'] = comments item ['likes'] = likes item [' money'] = money .strip () else: item ['title'] = title item [' summary'] = summary item ['author'] = author item [' comments'] = comments item ['likes'] = likes print (item) yield item Except: pass
The above are the most important files, for beginners as long as according to the corresponding requirements to complete the code of these files can crawl to the data.
3. Case operation and results
Now to run the overall code, you need to create the above main file. The specific main file code is as follows:
The running crawler file is: zhuantispider, which must not be written as the Scrapy project name
The result of the crawl run is as follows:
That's all for "how to use the Python Scrapy crawler framework". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.