How to use python to learn how to crawl data of experts in scrapy pipeline 07/06 Update SLTechnology News&Howtos

How to use python to learn how to crawl data of experts in scrapy pipeline

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to use python for scrapy pipeline learning crawling expert data, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

Crawl target site analysis

The target site of this collection is: https://www.zaih.com/falcon/mentors, and the target data is the data of experts in the industry.

This time the data is saved to the MySQL database. Based on the target data, the design table structure is as follows.

Contrast table structure, you can directly write the items.py file in scrapy.

Class ZaihangItem (scrapy.Item): # define the fields for your item here like: name = scrapy.Field () # name city = scrapy.Field () # City industry = scrapy.Field () # Industry price = scrapy.Field () # Price chat_nums = scrapy.Field () # number of chats score = scrapy.Field () # score coding time

The creation process of the project can refer to the previous case, this paper is written directly from the collection file development, the file is zh.py.

The paging address of the target data needs to be manually stitched, so declare an instance variable (field) in advance. The field is page. After each response, determine whether the data is empty. If not, perform the + 1 operation.

The request address template is as follows:

Https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name= Psychology & page= {}

When the page number exceeds the maximum number of pages, the following page status is returned, so the data is empty. You only need to determine whether there is a section for class=empty.

Parsing data and data clearly refer directly to the following code.

Import scrapyfrom zaihang_spider.items import ZaihangItemclass ZhSpider (scrapy.Spider): name = 'zh' allowed_domains = [' www.zaih.com'] page = 1 # starting page url_format = 'https://www.zaih.com/falcon/mentors?first_tag_id=479&first_tag_name=%E5%BF%83%E7%90%86&page={}' # template start_urls = [url_format.format (page)] def parse (self Response): empty = response.css ("section.empty") # determine whether the data is empty if len (empty) > 0: return # has an empty tag Directly return mentors = response.css (".mentor-board a") # all master hyperlinks for m in mentors: item = ZaihangItem () # instantiate an object name = m.css (".mentor-card__name::text"). Extract_first () city = m.css (".mentor-card__location::text"). Extract_first () industry = m.css (".mentor-card__title::text"). Extract_first () price = self.replace_space (m.css (".mentor-card__price::text") .extract_first () chat_nums = self.replace_space (m.css (".ment or-card__number::text"). Extract () [0]) Score = self.replace_space (m.css (".mentor-card__number::text"). Extract () [1]) # formatted data item ["name"] = name item ["city"] = city item ["industry"] = industry item ["price"] = price item ["chat_nums"] = chat_nums Item ["score"] = score yield item # generate a request again self.page + = 1 next_url = format (self.url_format.format (self.page)) yield scrapy.Request (url=next_url Callback=self.parse) def replace_space (self, in_str): in_str = in_str.replace ("\ n", "). Replace ("\ r ","). Replace ("¥", ") return in_str.strip ()

Open the ITEM_PIPELINES in the settings.py file. Note that the class name has been changed.

ITEM_PIPELINES = {'zaihang_spider.pipelines.ZaihangMySQLPipeline': 300,}

Modify the pipelines.py file to save the data to the MySQL database

In the following code, you first need to understand the class method from_crawler, which is a proxy for _ _ init__. If it exists, the class will be called when it is initialized and get the global crawler, and then you can get each configuration item in the settings.py through crawler.

In addition, there is a from_settings method that is generally used in official plug-ins, as shown in the following example.

Classmethoddef from_settings (cls, settings): host= settings.get ('HOST') return cls (host) @ classmethoddef from_crawler (cls, crawler): # FIXME: for now, stats are only supported from this constructor return cls.from_settings (crawler.settings)

Before writing the following code, you need to write the configuration items in settings.py in advance.

Settings.py file code

HOST = "127.0.0.1" PORT = 3306USER = "root" PASSWORD = "123456" DB = "zaihang"

Pipelines.py file code

Import pymysqlclass ZaihangMySQLPipeline: def _ init__ (self, host, port, user, password, db): self.host = host self.port = port self.user = user self.password = password self.db = db self.conn = None self.cursor = None @ classmethod def from_crawler (cls, crawler): return cls (host=crawler.settings.get ('HOST')) Port=crawler.settings.get ('PORT'), user=crawler.settings.get (' USER'), password=crawler.settings.get ('PASSWORD'), db=crawler.settings.get (' DB')) def open_spider (self, spider): self.conn = pymysql.connect (host=self.host, port=self.port, user=self.user, password=self.password, db=self.db) def process_item (self, item) Spider): # print (item) # Storage to MySQL name = item ["name"] city = item ["city"] industry = item ["industry"] price = item ["price"] chat_nums = item ["chat_nums"] score = item ["score"] sql = "insert into users (name,city,industry,price,chat_nums,score) values ('% s') '% name city, industry, float (price), int (chat_nums) Float (score)) print (sql) self.cursor = self.conn.cursor () # set cursor try: self.cursor.execute (sql) # execute sql self.conn.commit () except Exception as e: print (e) self.conn.rollback () return item def close_spider (self Spider): self.cursor.close () self.conn.close ()

Three important functions in the pipeline file are open_spider,process_item,close_spider.

# def open_spider (self, spider) is executed only once when the crawler is enabled: # spider.name = "Eraser" # instance variable is dynamically added to the spider object. You can obtain the value of this variable in the spider module, such as obtaining attributes through self in the parse (self, response) function # data extracted by some initialization actions pass# processing Data storage code writing location def process_item (self, item, spider): when the pass# crawler is closed, it is only executed once. If an abnormal crash occurs during the crawler, close_spider will not execute def close_spider (self, spider): # close the database and release resources pass crawl result display

This is the answer to the question about how to use python to learn how to crawl expert data in scrapy pipeline. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel to learn more about it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.