How to deploy web crawlers with Scrapy+Gerapy 04/11 Update SLTechnology News&Howtos

How to deploy web crawlers with Scrapy+Gerapy

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to use Scrapy+Gerapy to deploy web crawlers", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to deploy web crawlers with Scrapy+Gerapy.

Effect diagram of reptile management

Dependency package

File: requirements.txt

The contents of the document are posted directly here:

Appdirs==1.4.4 APScheduler==3.5.1 attrs==20.1.0 Automat==20.2.0 beautifulsoup4==4.9.1 certifi==2020.6.20 cffi==1.14.2 chardet==3.0.4 constantly==15.1.0 cryptography==3.0 cssselect==1.1.0 Django==1.11.29 django-apscheduler==0.3.0 django-cors-headers==3.2.0 djangorestframework==3.9.2 furl==2.1.0 gerapy==0.9.5 gevent==20.6.2 greenlet==0.4.16 hyperlink==20.0.1 idna==2.10 incremental==17.5.0 itemadapter==0.1.0 itemloaders==1.0 . 2 Jinja2==2.10.1 jmespath==0.10.0 lxml==4.5.2 MarkupSafe==1.1.1 orderedmultidict==1.0.1 parsel==1.6.0 Protego==0.1.16 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 PyDispatcher==2.0.5 pyee==7.0.2 PyHamcrest==2.0.2 pymongo==3.11.0 PyMySQL==0.10.0 pyOpenSSL==19.1.0 pyppeteer==0.2.2 pyquery==1.4.1 python-scrapyd-api==2.1.2 pytz==2020.1 pywin32==228 queuelib==1.5.0 redis==3.5.3 requests==2.24.0 Scrapy==1.8.0 scrapy-redis==0.6.8 scrapy-splash==0.7.2 scrapyd==1.2.1 scrapyd-client==1.1.0 service-identity==18.1.0 six==1.15.0 soupsieve==2.0.1 tqdm==4.48.2 Twisted==20.3.0 tzlocal==2.1 urllib3==1.25.10 w3lib==1.22.0 websocket==0.2.1 websockets==8.1 wincertstore==0.2 zope.event==4.4 zope.interface==5.1.0

Project file

Project file: qiushi.zip

Realize the function: embarrassing encyclopedia joke crawler

This is the Scrapy project, and the dependency package is as above

Run the project steps

After installing the dependency package and extracting the project file, pip install-r requirements.txt

Execute the command scrapy crawl duanzi-- nolog

Configure Scrapyd

It can be understood that Scrapyd is a person who manages the Scrapy project we wrote. After configuring this, you can control the crawler through command run, pause and other operations.

Let's not talk about the rest. This one doesn't use much. All we need to do is turn it on.

Start the Scrapyd service

1. Change to the qiushi crawler project directory. The Scrapy crawler project needs to enter the crawler directory to execute the command.

two。 Execute the command scrapyd

3. The following picture appears in the browser input http://127.0.0.1:6800/, indicates that it is correct.

Package Scrapy and upload it to Scrapyd

The above only launches Scrapyd, but does not deploy the Scrapy project to Scrapy. You need to configure the scrapy.cfg file of the following Scrapy

The configuration is as follows

Package command

Scrapyd-deploy-p

This sample command

Scrapyd-deploy qb-p qiushi

As shown in the figure, the following pictures indicate success

Note: there may be problems in the process, I put the solution in the back!

If you go back to the browser again, you will have one more project, qiushi. At this point, Scrapyd has been configured.

Configure Gerapy

After all the above configuration, you can configure Gerapy. In fact, the function of Scrapyd is much less than that mentioned above, but it is operated by commands, so it is not friendly.

The Gerapy visual crawler management framework needs to start the Scrapyd and hang it in the background when it is used. In fact, it is essentially a request to the Scrapyd service, which is just a visual operation.

Development based on Scrapy, Scrapyd, Scrapyd-Client, Scrapy-Redis, Scrapyd-API, Scrapy-Splash, Jinjia2, Django, Vue.js

Configuration step

Gerapy and Scrapy are not related, so you can choose any folder, here I created a gerapyDemo folder

Execute the command to initialize gerpay

Gerapy init

1. A gerapy folder is generated

two。 Go to the generated gerapy folder

3. Execute the command and a table is generated

Gerapy migrate

4. Start the gerapy service. The default is port 8000. You can specify the port to start.

Gerapy runserver gerapy runserver 127.0.0.1 9000 native port 9000 starts

4. Open the browser and enter http://127.0.0.1:8000/, to display the following interface to indicate success

Of course, in general, it looks like this interface, and we need to generate the account password.

Stop the service, enter the command gerapy creatsuperuser, follow the prompts to create an account password and log in with the account.

Add a crawler project to Gerapy

After all the above configuration, we can configure the crawler project, and by little by little, we can run the crawler.

Click Host Management-> create. Ip is the host of Scrapyd service, and the port is the port of Scrapyd. Default is 6800. Enter it and click create.

Then in the host list, scheduling, you can run the crawler.

Run the crawler

Get the result, which has been written locally

Package crawler upload

The above process, we can only play crawlers, but not completely, according to reason, we still need a packaging process, only packaging crawlers, can be regarded as a real combination.

Steps

1. First, you need to copy the crawler project to the projects folder under gerapy.

two。 Refresh the page, click Project Management, and you can see that the configurable and packaged status is x.

3. Click deploy, write a description, and click package

4. When you go back to the main interface again, you can find that the packaging is correct.

At this point, basically the whole process is over.

Resolving scrapyd-deploy is not an internal and external command

Normally, when you execute scrapyd-deploy, you will prompt that scrapyd-deploy is not an internal or external command, um. This is a normal operation.

Resolution steps

1. Find the Scripts under the Python interpreter, and create two new files, scrapy.bat and scrapyd-deploy.bat

Modify these two files as follows

Scrapy.bat

Echo off D:\ programFiles\ miniconda3\ envs\ hy_spider\ python D:\ programFiles\ miniconda3\ envs\ hy_spider\ Scripts\ scrapy% *

Scrapyd-deploy.bat

Echo off D:\ programFiles\ miniconda3\ envs\ hy_spider\ python D:\ programFiles\ miniconda3\ envs\ hy_spider\ Scripts\ scrapyd-deploy% *

Note: the red box indicates the position of the interpreter, the above content is one line, I paste over how to adjust it to two lines.

Summary of Gerapy usage process

1.gerapy init initialization, will create a gerapy folder under the folder 2.cd gerapy 3.gerapy migrate 4.gerapy runserver default is 127.0.0.1 2.cd gerapy 3.gerapy migrate 4.gerapy runserver 8000 5.gerapy createsuperuser create account password, default is not 6. Enter the login password of 127.0.0.1virtual 8000 and enter the home page 7. A variety of operations, such as adding hosts, packaging projects, scheduled tasks and so on, I believe you have a deeper understanding of "how to deploy web crawlers with Scrapy+Gerapy", you might as well to actually operate it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.