Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use the PySpider framework of Python crawler

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains the "Python crawler PySpider framework how to use", the article explains the content is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "Python crawler PySpider framework how to use" it!

Pyspider is a crawler framework that supports task monitoring, project management, multiple databases and WebUI. It is written in Python language and has a distributed architecture. The detailed features are as follows:

Web script editing interface, task monitor, project manager and structure viewer

Database supports MySQL, MongoDB, Redis, SQLite, Elasticsearch, PostgreSQL, SQLAlchemy

Queue service supports RabbitMQ, Beanstalk, Redis, Kombu

Support crawling JavaScript pages

Component replaceable, support stand-alone, distributed deployment, support Docker deployment

Strong scheduling control, support for overtime re-crawling and priority setting

Python2&3 is supported.

PySpider is mainly divided into three parts: Scheduler (scheduler), Fetcher (crawler) and Processer (processor). The whole crawling process is monitored by Monitor (monitor), and the crawled results are processed by Result Worker (result processor).

The basic process is as follows: Scheduler initiates task scheduling, Fetcher grabs web content, Processer parses web content, then sends the newly generated Request to Scheduler for scheduling, and outputs and saves the resulting extraction results.

PySpiderScrapy visualization has WebUI, and the crawler can be written and debugged in WebUI using code and command line operations. To achieve visualization, you need to dock PortiaJS rendering crawl support the collection of JavaScript rendered pages using PhantomJS need to dock Scrapy-Splash component parsing built-in PyQuery as a selector docking XPath, CSS selector, regular matching expansion weak modules have low coupling and strong expansibility, such as docking Middleware, Pipeline and other components to achieve stronger functions.

Generally speaking, PySpider is more convenient and Scrapy is more scalable. If you want to quickly achieve the crawling preferred PySpider, if the crawling scale is large and the anti-crawling mechanism is strong, Scrapy is preferred.

Installation

PySpider is no longer maintained and is only supported to python3.6, so installing versions above 3.6 may cause error problems, and can be installed in the following ways:

Install wheel (already installed but skipped)

Conda install wheel

Install pycurl (already installed but skipped)

Conda install pycurl

Install pyspider

Pip install pyspider

Install phantomjs

After decompression, copy the phantomjs.exe under the bin directory to the directory where the python.exe file is located.

Find.. / Python/Lib/python3.7/site-packages/pyspider/run.py,../Python/Lib/site-packages/pyspider/fetcher/tornado_fetcher.py,../Python/Lib/site-packages/pyspider/webui/app.py and change the name of async in the file to another name (not a keyword), such as asynch.

Find the.. / Python/Lib/site-packages/pyspider/webui/webdav.py file, change 'domaincontroller': NeedAuthController (app), and modify it to:

'http_authenticator': {'HTTPAuthenticator':NeedAuthController (app)}

Lower wsgidav version

Python-m pip uninstall wsgidav # Uninstall python-m pip install werkzeug==1.0.0 # install version 1.0.0 run

Enter pyspider or pyspider all in the cmd window to start all

The browser accesses http://localhost:5000/. If the browser accesses it successfully, it will run successfully.

Create a project

Create-- > Project Name-- > Start URL (can be omitted from the code and then written)-- > Create

Debug code

Interpretation of the source code #! / usr/bin/env python#-*-encoding: utf-8-*-# Created on 2021-07-04 20 utf-8 Project: tripadvisorfrom pyspider.libs.base_handler import * class Handler (BaseHandler): crawl_config = {# global parameter setting Request header and cookies can be set here (pass in keywords and corresponding parameters)} def _ _ init__ (self): # initial configuration self.base_url = 'https://www.tripadvisor.cn/Attractions-g298555-Activities?subcategory=49&subtype=0' @ every (minutes=24 * 60) # decorator @ every sets how often to crawl (24060 is once a day) def on_start (self): # entry to start crawling''crawl and requests have the same function Get (default) and post are supported. Common parameters are: data: submit data you want callback: callback function method to be called after executing crawl: specify access method files: upload files, {'key': (' file.name': 'content')} headers: request header Type dict cookies: Cookies of the request, type dict timeout: the maximum number of seconds to wait in the request content. Default: 120 connect_timeout: specify the link timeout during the request (in seconds). Default: 20 proxy: set proxy server Currently only supports http proxy''self.crawl (self.base_url, callback=self.index_page,fetch_type='js') # fetch_type='js' crawling asynchronous data # index_page and detail_page are only callback functions in the initial script, except on_start, other function names can be customized''@ config: age: set the validity period of the task The page crawled by the target during this period is not considered to be modified. Priority: set the priority of the task. Larger means more priority to execute auto\ _ recrawl: set whether to crawl again every age time. The default value is: False priority: this parameter is used to specify the priority of the task. The higher the value, the first to be executed. The default value is 0 retries: the number of retries after task execution fails. The default value is 3 itag: task tag value, which will be compared when fetching. If this value changes, new content will be crawled again regardless of the expiration date. Most of them are used to dynamically determine whether the content is modified or forced to re-crawl. The default value is None'@ config (age=10 * 24 * 60 * 60) # valid for ten days Data within ten days will not be crawled repeatedly def index_page (self, response): for each in response.doc ('a [href^ = "http"]') .items (): self.crawl (each.attr.href, callback=self.detail_page,fetch_type='js') @ config (priority=2) def detail_page (self) Response): return {''response: crawl: the object returned is the response object response.url: returns the last URL address response.text: the text format content of the request response (if Response.encoding is None or chardet module is available The response content is automatically parsed to the specified encoding) response.doc: this method will call the PyQuery library to generate a PyQuery object with the returned content to facilitate the use of response.json: this method will call the JSON related library to parse the returned content response. Status_code: return the status code of the response response.headers: the header information of the request response Dict format response.cookies: response cookies response.time: time used to grab''"url": response.url, "title": response.doc ('title') .text () # text () returns the text text "html": response.doc ('title'). Html () # html () returns the web page, including the tag} Thank you for reading The above is the content of "how to use the PySpider framework of Python crawler". After the study of this article, I believe you have a deeper understanding of how to use the PySpider framework of Python crawler. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report