In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article focuses on "how to use Scrapy+Gerapy to deploy web crawlers", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to deploy web crawlers with Scrapy+Gerapy.
Effect diagram of reptile management
Dependency package
File: requirements.txt
The contents of the document are posted directly here:
Appdirs==1.4.4 APScheduler==3.5.1 attrs==20.1.0 Automat==20.2.0 beautifulsoup4==4.9.1 certifi==2020.6.20 cffi==1.14.2 chardet==3.0.4 constantly==15.1.0 cryptography==3.0 cssselect==1.1.0 Django==1.11.29 django-apscheduler==0.3.0 django-cors-headers==3.2.0 djangorestframework==3.9.2 furl==2.1.0 gerapy==0.9.5 gevent==20.6.2 greenlet==0.4.16 hyperlink==20.0.1 idna==2.10 incremental==17.5.0 itemadapter==0.1.0 itemloaders==1.0 . 2 Jinja2==2.10.1 jmespath==0.10.0 lxml==4.5.2 MarkupSafe==1.1.1 orderedmultidict==1.0.1 parsel==1.6.0 Protego==0.1.16 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.20 PyDispatcher==2.0.5 pyee==7.0.2 PyHamcrest==2.0.2 pymongo==3.11.0 PyMySQL==0.10.0 pyOpenSSL==19.1.0 pyppeteer==0.2.2 pyquery==1.4.1 python-scrapyd-api==2.1.2 pytz==2020.1 pywin32==228 queuelib==1.5.0 redis==3.5.3 requests==2.24.0 Scrapy==1.8.0 scrapy-redis==0.6.8 scrapy-splash==0.7.2 scrapyd==1.2.1 scrapyd-client==1.1.0 service-identity==18.1.0 six==1.15.0 soupsieve==2.0.1 tqdm==4.48.2 Twisted==20.3.0 tzlocal==2.1 urllib3==1.25.10 w3lib==1.22.0 websocket==0.2.1 websockets==8.1 wincertstore==0.2 zope.event==4.4 zope.interface==5.1.0
Project file
Project file: qiushi.zip
Realize the function: embarrassing encyclopedia joke crawler
This is the Scrapy project, and the dependency package is as above
Run the project steps
After installing the dependency package and extracting the project file, pip install-r requirements.txt
Execute the command scrapy crawl duanzi-- nolog
Configure Scrapyd
It can be understood that Scrapyd is a person who manages the Scrapy project we wrote. After configuring this, you can control the crawler through command run, pause and other operations.
Let's not talk about the rest. This one doesn't use much. All we need to do is turn it on.
Start the Scrapyd service
1. Change to the qiushi crawler project directory. The Scrapy crawler project needs to enter the crawler directory to execute the command.
two。 Execute the command scrapyd
3. The following picture appears in the browser input http://127.0.0.1:6800/, indicates that it is correct.
Package Scrapy and upload it to Scrapyd
The above only launches Scrapyd, but does not deploy the Scrapy project to Scrapy. You need to configure the scrapy.cfg file of the following Scrapy
The configuration is as follows
Package command
Scrapyd-deploy-p
This sample command
Scrapyd-deploy qb-p qiushi
As shown in the figure, the following pictures indicate success
Note: there may be problems in the process, I put the solution in the back!
If you go back to the browser again, you will have one more project, qiushi. At this point, Scrapyd has been configured.
Configure Gerapy
After all the above configuration, you can configure Gerapy. In fact, the function of Scrapyd is much less than that mentioned above, but it is operated by commands, so it is not friendly.
The Gerapy visual crawler management framework needs to start the Scrapyd and hang it in the background when it is used. In fact, it is essentially a request to the Scrapyd service, which is just a visual operation.
Development based on Scrapy, Scrapyd, Scrapyd-Client, Scrapy-Redis, Scrapyd-API, Scrapy-Splash, Jinjia2, Django, Vue.js
Configuration step
Gerapy and Scrapy are not related, so you can choose any folder, here I created a gerapyDemo folder
Execute the command to initialize gerpay
Gerapy init
1. A gerapy folder is generated
two。 Go to the generated gerapy folder
3. Execute the command and a table is generated
Gerapy migrate
4. Start the gerapy service. The default is port 8000. You can specify the port to start.
Gerapy runserver gerapy runserver 127.0.0.1 9000 native port 9000 starts
4. Open the browser and enter http://127.0.0.1:8000/, to display the following interface to indicate success
Of course, in general, it looks like this interface, and we need to generate the account password.
Stop the service, enter the command gerapy creatsuperuser, follow the prompts to create an account password and log in with the account.
Add a crawler project to Gerapy
After all the above configuration, we can configure the crawler project, and by little by little, we can run the crawler.
Click Host Management-> create. Ip is the host of Scrapyd service, and the port is the port of Scrapyd. Default is 6800. Enter it and click create.
Then in the host list, scheduling, you can run the crawler.
Run the crawler
Get the result, which has been written locally
Package crawler upload
The above process, we can only play crawlers, but not completely, according to reason, we still need a packaging process, only packaging crawlers, can be regarded as a real combination.
Steps
1. First, you need to copy the crawler project to the projects folder under gerapy.
two。 Refresh the page, click Project Management, and you can see that the configurable and packaged status is x.
3. Click deploy, write a description, and click package
4. When you go back to the main interface again, you can find that the packaging is correct.
At this point, basically the whole process is over.
Resolving scrapyd-deploy is not an internal and external command
Normally, when you execute scrapyd-deploy, you will prompt that scrapyd-deploy is not an internal or external command, um. This is a normal operation.
Resolution steps
1. Find the Scripts under the Python interpreter, and create two new files, scrapy.bat and scrapyd-deploy.bat
Modify these two files as follows
Scrapy.bat
Echo off D:\ programFiles\ miniconda3\ envs\ hy_spider\ python D:\ programFiles\ miniconda3\ envs\ hy_spider\ Scripts\ scrapy% *
Scrapyd-deploy.bat
Echo off D:\ programFiles\ miniconda3\ envs\ hy_spider\ python D:\ programFiles\ miniconda3\ envs\ hy_spider\ Scripts\ scrapyd-deploy% *
Note: the red box indicates the position of the interpreter, the above content is one line, I paste over how to adjust it to two lines.
Summary of Gerapy usage process
1.gerapy init initialization, will create a gerapy folder under the folder 2.cd gerapy 3.gerapy migrate 4.gerapy runserver default is 127.0.0.1 2.cd gerapy 3.gerapy migrate 4.gerapy runserver 8000 5.gerapy createsuperuser create account password, default is not 6. Enter the login password of 127.0.0.1virtual 8000 and enter the home page 7. A variety of operations, such as adding hosts, packaging projects, scheduled tasks and so on, I believe you have a deeper understanding of "how to deploy web crawlers with Scrapy+Gerapy", you might as well to actually operate it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.