Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to use Python pyspider

2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Python pyspider related knowledge, content detailed and easy to understand, simple and fast operation, has a certain reference value, I believe everyone will have a harvest after reading this Python pyspider how to use the article, let's take a look at it.

1 Introduction

pyspider is a crawler framework that supports task monitoring, project management, multiple databases, and WebUI. It is written in Python and has a distributed architecture. Detailed characteristics are as follows:

Web script editing interface, task monitor, project manager and structure viewer;

Database support MySQL, MongoDB, Redis, SQLite, Elasticsearch, PostgreSQL, SQLAlchemy;

Queue services support RabbitMQ, Beanstalk, Redis, Kombu;

Pages that support JavaScript crawling;

Component replaceable, support stand-alone, distributed deployment, Docker deployment;

Powerful scheduling control, support timeout re-climbing and priority setting;

Python 2 & 3 is supported.

Pyspider is mainly divided into three parts: Scheduler, Fetcher and Processor. The whole crawling process is monitored by Monitor, and the grabbed results are processed by Result Worker. The basic process is as follows: Scheduler initiates task scheduling, Fetcher grabs web page content, Processor parses web page content, then sends newly generated Request to Scheduler for scheduling, and outputs and saves generated extraction results.

2 pyspider vs scrapy

Pyspider has WebUI, crawler writing, debugging can be carried out in WebUI;Scrapy uses code, command line operation, visualization needs to dock Portia.

Pyspider supports collection of JavaScript rendering pages using PhantomJS;Scrapy needs to interface with Scrapy-Splash components.

Pyspider has built-in PyQuery (Python crawler (5): PyQuery framework) as selector;Scrapy interfaces XPath, CSS selector, regular matching.

Pyspider has weak scalability;Scrapy modules have low coupling degree and strong scalability, such as docking Middleware, Pipeline and other components to achieve stronger functions.

In general, pyspider is more convenient, Scrapy is more scalable, if you want to quickly crawl pyspider is preferred, if the crawling scale is large, the anti-crawling mechanism is strong, scrapy is preferred.

3 Installation Method 1

pip install pyspider

Command "python setup.py egg_info" failed with error... I encountered this problem when installing on my Windows system, so I chose the second way below to install.

method II

Use wheel installation. The procedure is as follows:

pip install wheel Install wheel;

Open the URL https://www.lfd.uci.edu/<$gohlke/pythonlibs/, use Ctrl + F to search pyurl, and select the appropriate version to download according to the Python version installed, for example: Python3.6 I use, select the version with cp36 logo. As shown in the red box below:

Use pip install to download files, such as: pip install E:\pyurl-7.43.0.3-cp36-cp36 m-win_amd64.whl;

Finally, use pip install pyspider.

After performing the above installation steps, we enter pyspider in the console, as shown in the figure:

The above results indicate that the startup is successful. If the startup is stuck in result_worker starting..., We can open another console window, enter pyspider to start it, and close the previous window after starting successfully.

After the launch is successful, we verify it again. Open the browser and enter http://localhost:5000 to access it, as shown in the figure:

And we found that it did.

4 Getting Started 4.1 Creating Projects

First, we click the Create button in the graphical interface to start creating the project, as shown in the red box in the figure:

Then the information filling window will pop up, as shown in the figure:

Project Name: Project Name

Start URL(s): crawl link address

We need to fill in Project Name and Start URL(s). Here, take the second-hand house information of Lianjia. com as an example: https://hz.lianjia.com/ershoufang. Click Create button after filling in. The results are as follows:

4.2 crawler implementation

Pyspider will prompt for certificate problems (usually HTTP 599) when visiting https sites, so we need to add the parameter validate_cert=False to the crawl method to block certificate validation. As shown in the figure:

We plan to obtain the unit price (unit_price), description title (title), selling point information (sell_point) of the house. The specific implementation is as follows:

from pyspider.libs.base_handler import *class Handler(BaseHandler): crawl_config = { } @every(minutes=24 * 60) def on_start(self): self.crawl('https://hz.lianjia.com/ershoufang/', callback=self.index_page,validate_cert=False) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('.title').items(): self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False) @config(priority=2) def detail_page(self, response): yield { 'unit_price':response.doc('.unitPrice').text(), 'title': response.doc('.main').text(), 'sell_point': response.doc('.baseattribute > .content').text() }

@every(minutes=24 * 60): Notifications Scheduler runs once a day.

@config(age=10 * 24 * 60 * 60): Sets the expiration date of the task.

@config(priority=2): Set task priority

on_start(self): Entry to the program.

self.crawl(url, callback): The main method used to create a crawl task.

index_page(self, response): Used to retrieve the data corresponding to the tag in the returned html document.

detail_page(self, response): Returns a dict object as the result.

We click the Run button, as shown in the figure:

After clicking, we found that a prompt message appeared at the follow button, as shown in the figure:

Click the follow button, and the result is as shown in the figure:

Click the triangle button circled in the red box in the above picture, and the result is as shown in the figure:

We randomly select a detail_page and click the triangle button on the right side. The result is as shown in the figure:

As a result, we can already crawl the information we need.

4.3 data storage

Once the information is retrieved, it needs to be stored, and we plan to store the data in a MySQL database.

First, install pymysql with the following command:

pip install pymysql

Then add the save code, the complete code is as follows:

from pyspider.libs.base_handler import *import pymysqlclass Handler(BaseHandler): crawl_config = { } def __init__(self): #Modify the following parameters to their corresponding MySQL information self.db = MySQLdb.connect(ip, username, password, db, charset='utf8') def add_Mysql(self, title, unit_price, sell_point): try: cursor = self.db.cursor() sql = 'insert into house(title, unit_price, sell_point) values ("%s","%s","%s")' % (title[0],unit_price[0],sell_point); print(sql) cursor.execute(sql) self.db.commit() except Exception as e: print(e) self.db.rollback() @every(minutes=24 * 60) def on_start(self): self.crawl('https://hz.lianjia.com/ershoufang/', callback=self.index_page,validate_cert=False) @config(age=10 * 24 * 60 * 60) def index_page(self, response): for each in response.doc('.title').items(): self.crawl(each.attr.href, callback=self.detail_page,validate_cert=False) @config(priority=2) def detail_page(self, response): title = response.doc('.main').text(), unit_price = response.doc('.unitPrice').text(), sell_point = response.doc('.baseattribute > .content').text() self.add_Mysql(title, unit_price, sell_point) yield { 'title': response.doc('.main').text(), 'unit_price':response.doc('.unitPrice').text(), 'sell_point': response.doc('.baseattribute > .content').text() }

First test whether you can save the data to MySQL, or choose a detail_page, as shown in the figure:

Click the triangle button on the right side, and the result is as shown in the figure:

From the output result, we can see that the save operation has been performed. Let's take a look at MySQL, as shown in the figure:

The data is stored in MySQL.

Above we are manually saved data, next look at how to save by setting tasks.

Click the pyspider button in the upper left corner of the current page, as shown in the figure:

Return to the dashboard interface, as shown in the figure:

We click on the position circled by the red box below status, change the status to RUNNING or DEBUG, and then click the run button below actions.

About "Python pyspider how to use" The content of this article is introduced here, thank you for reading! I believe everyone has a certain understanding of "how to use Python pyspider" knowledge. If you still want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report