Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to quickly build a practical crawler management platform

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to quickly build a practical reptile management platform". The content in the article is simple and clear, and it is easy to learn and understand. let's go deep into the editor's train of thought. Let's study and learn how to quickly build a practical reptile management platform!

How important is the reptile?

For search engines, crawlers are indispensable; for public opinion companies, crawlers are the foundation; for NLP, crawlers can access corpus; and for startups, crawlers can get initial content. However, the crawler technology is numerous and complicated, and different types of crawling scenes will be used in different technologies. For example, simple static pages can be done directly with HTTP requests + HTML parsers; a dynamic page needs automated testing tools such as Puppeteer or Selenium; sites with anti-crawling need to use proxies, coding and other techniques; and so on. Then a mature crawler management platform is needed to help enterprises or individuals deal with a large number of crawler categories.

Understand what is the definition of crawler management platform

The crawler management platform is an one-stop management system, which integrates crawler deployment, task scheduling, task monitoring, result display and other modules. It is usually equipped with a visual UI interface, which can effectively manage the crawler through interaction with the UI interface on the Web side. Generally speaking, the crawler management platform supports distributed and can be run collaboratively on multiple machines.

Of course, the above definition is narrow and usually applies to technicians or developers or technical managers. Enterprises generally develop their own internal crawler management system to cope with complex crawler management requirements. Such a system is the narrow crawler management platform defined above.

Generalized crawler management platform

And what is the broad sense of crawler management platform? You may have heard of Archer (later transformed into Hou Yi collector) and octopus. The former is based on cloud services and can write, run and monitor crawlers online, which is closest to the narrowly defined crawler management platform in the broad sense of the crawler platform, while the latter is a popular commercial crawler crawling tool that allows novice users to drag and drop to write, run and export data. You may also have seen a variety of API aggregation service providers, such as aggregating data, which is a platform that can directly call the website interface to obtain data, which is actually a variation of the crawler platform, but it helps you complete the crawler writing process. In between, there is a foreign company called Kimonolab, which has developed a Chrome plug-in called Kimono, which allows users to visually click on elements on the page and generate crawling rules, and generate crawler programs on their websites, users submit tasks, and the background can automatically crawl data on the site. Kimono is a great crawler application, but unfortunately, Kimonolab has been acquired by big data's Plantir, and now it is impossible to experience it.

In this paper, we mainly focus on the narrow definition of the crawler management platform, so the crawler management platform mentioned later refers to the narrow definition.

Crawler management platform module

The following are the modules involved in a typical crawler management platform.

Architecture of crawler management platform

The modules of a typical crawler management platform mainly include the following:

Task management: how to execute and schedule crawler crawling tasks, and how to monitor tasks, including log monitoring, etc.

Crawler management: including crawler deployment, soon to be developed crawler deployment (packaging or replication) to the appropriate nodes, as well as crawler configuration and version management

Node management: including node (server / machine) registration and monitoring, as well as communication between nodes, how to monitor node performance, etc.

Front-end applications: includes a visual UI interface that allows users to communicate with background applications by interacting with them.

Of course, some crawler management platforms may have more than these modules, and it may include other practical features, such as configurable crawling rules, visual configuration crawling rules, proxy pool, Cookie pool, exception monitoring, and so on.

Why do you need a crawler management platform

With the crawler management platform, developers, especially crawler engineers, can easily add crawlers, perform tasks and view results without having to switch back and forth between command lines, which is very error-prone. A common scenario is that the crawler engineer initially uses scrapy and crontab to manage the crawler task. He has to carefully choose the time interval of the scheduled task so that he will not fill up the server CPU or memory. The more thorny problem is that he also needs to save the logs generated by scrapy to the file. If the crawler makes a mistake, he has to use shell command to check the log one by one to locate the cause of the error, which will take a whole day. Another serious problem is that the crawler engineer may find that the company's business volume is increasing. He needs to write hundreds of crawlers to meet the company's business needs. Using scrapy and crontab to manage is a complete nightmare. In fact, the poor crawler engineer can choose a suitable crawler management platform to solve his problem.

How to choose a suitable crawler management platform

When you are willing to solve the difficult problems encountered by crawler engineers mentioned earlier and want to choose a suitable crawler management platform instead.

The first question you should answer is: do we need to develop a system (Start from scratch) from scratch? To answer this question, you should first answer the following questions:

Are our requirements so complex that we need to fully customize the development of a new system (such as requiring complex rights management)?

Does our team have sufficient technical strength to develop this system (for example, experienced front and rear development engineers)?

Are our time resources sufficient for us to develop the system (for example, the project planning cycle is one year)?

If the answer to any of the above three questions is "no", you should consider using the existing open source crawler management platform to meet your needs.

The following are the existing open source crawler management platforms on the market:

Generally speaking, SpiderKeeper may be the earliest crawler management platform, but its functions are relatively limited. Although Gerapy has complete functions and beautiful interface, there is a lot of bug to deal with, and users with needs are advised to wait for version 2.0. Scrapydweb is a relatively complete crawler management platform, but, like the former two, it is based on scrapyd, so it can only run scrapy crawlers. Crawlab is a very flexible crawler management platform, which can run crawlers written by Python, Nodejs, Java, PHP and Go with complete functions, but it is more troublesome to deploy compared to the first three, but a deployment can be done for Docker users (we will talk about it later).

Therefore, for developers who rely heavily on scrapy crawlers and do not want to struggle, consider Scrapydweb; while crawler developers with various types of complex technical structures should give priority to more flexible Crawlab. Of course, it's not that Crawlab is not friendly to scrapy support. Crawlab can also integrate scrapy very well, which will be discussed later.

As the author of Crawlab, I don't want every potter praises his pot to brag, the author just wants to recommend the best technology to developers, so that developers can decide which crawler management platform to use according to their own needs.

A brief introduction to the crawler management platform Crawlab

Crawlab is a distributed crawler management platform based on Golang, which supports Python, NodeJS, Java, Go, PHP and other programming languages as well as a variety of crawler frameworks.

Crawlab has been well received by crawler enthusiasts and developers since its launch in March this year, and many users have also indicated that they will use Crawlab to build the company's crawler platform. After several months of iteration, Crawlab has successively launched scheduled tasks, data analysis, website information, configurable crawler, automatic extraction of fields, download results, upload crawler and other functions, making the platform more practical and comprehensive, and can really help users solve the difficult problem of crawler management. Today, there are nearly 1k star on Github, and related communities have been established, and 1/4 of users say they have applied Crawlab to enterprise crawler management. It can be seen that Crawlab is concerned and loved by developers.

Solve the problem

Crawlab mainly solves the problem that it is difficult to manage a large number of crawlers, such as the projects of scrapy and selenium that need to monitor hundreds of websites are not easy to manage at the same time, and the cost of command line management is very high, and it is easy to make mistakes. Crawlab supports any language and any framework. With task scheduling and task monitoring, it is easy to effectively monitor and manage large-scale crawler projects.

Interface and use

The following is a screenshot of the Crawlab crawler list page.

Crawlab crawler list

Users only need to upload the crawler to Crawlab, configure the execution command, and click the "run" button to perform the crawler task. Crawler tasks can be run on any node. As can be seen from the above figure, Crawlab has modules such as node management, crawler management, task management, timing tasks, user management and so on.

Overall architecture

The following is the overall architecture diagram of Crawlab, which consists of five parts:

Master node (Master Node): responsible for task dispatch, API, deployment of crawlers, etc.

Work node (Worker Node): responsible for performing crawler tasks

MongoDB database: storing daily running data such as nodes, crawlers, tasks, etc.

Redis database: stores information such as task message queue, node heartbeat, etc.

Front-end client: Vue application, which is responsible for front-end interaction and data request from the back-end.

Crawlab architecture

How to use Crawlab and detailed principles are beyond the scope of this article. If you are interested, you can refer to the Github home page or related documentation.

Github address and Demo

View demo Demo

Github: https://github.com/tikazyq/crawlab

Install the CrawlabDocker image using Docker deployment

Docker is the most convenient and concise way to deploy Crawlab. Other deployment methods include direct deployment, but it is not recommended for developers who want to build the platform quickly. Crawlab has registered the relevant images on Dockerhub, and developers only need to execute the docker pull tikazyq/crawlab command to download the images of Crawlab.

Readers can check the image of Crawlab on Dockerhub, only less than 300Mb. Address: https://hub.docker.com/r/tikazyq/crawlab/tags

Dockerhub Page

Install Docker

To use Docker to deploy Crawlab, you must first make sure that Docker is installed. Please refer to the following documentation to install.

Install Docker Compose

Docker Compose is a simple tool for running Docker clusters, which is very lightweight, and we will use Docker Compose to deploy Crawlab with one click.

Docker's official website already has tutorials on how to install Docker Compose. Click the link to see it. Here is a brief introduction.

For Linux users, please install with the following command.

# download docker-composesudo curl-L "https://github.com/docker/compose/releases/download/1.24.1/docker-compose-$(uname-s)-$(uname-m)"-o / usr/local/bin/docker- compose` # turn docker-compose into an execution file sudo chmod + x / usr/local/bin/docker-compose pull image

Before pulling the image, you need to configure the mirror source. Because in China, the speed of using the original mirror source is not very fast, it is necessary to use the domestic accelerator of DockerHub. Please create a / etc/docker/daemon.json file and enter the following.

{"registry-mirrors": ["https://registry.docker-cn.com"]}"

Then pull the mirror image, it will be much faster. Of course, you can also use other mirror sources, you can search the Internet. Execute the following command to pull down the Crawlab image.

Docker pull tikazyq/crawlab:latest

The following picture shows the command line interface when pulling the image.

Docker pull

Start Crawlab

We will start Crawlab and its dependent databases MongoDB and Redis with Docker Compose. First, we need to modify Docker Compose's yaml configuration file, docker-compose.yml. This configuration file defines the container service (Container Services) and network configuration (Network Configuration) that need to be started. Here we use the docker-compose.yml that comes with Crawlab.

Version: version number of '3.3' # Docker Compose (see instructions later) services: # Service master: # Service name image: tikazyq/crawlab:latest # Service Image name container_name: master # Service Container name environment: # here define the passed environment variable CRAWLAB_API_ADDRESS: "localhost:8000" # API address called by the front end The default is localhost:8000 CRAWLAB_SERVER_MASTER: "Y" # whether it is the primary node, and Y CRAWLAB_MONGO_HOST N CRAWLAB_MONGO_HOST: "mongo" # MongoDB host, because in Docker Compose, you can refer to the service name CRAWLAB_REDIS_ADDRESS: "redis" # Redis host, because in Docker Compose You can reference the service name ports: # mapped port-"8080 MongoDB 8080" # front-end port-"8000 MongoDB 8000" # back-end port depends_on: # dependent service-mongo # MongoDB-redis # Redis worker: # worker node, similar to the master node configuration Do not repeat image: tikazyq/crawlab:latest container_name: worker environment: CRAWLAB_SERVER_MASTER: "N" CRAWLAB_MONGO_HOST: "mongo" CRAWLAB_REDIS_ADDRESS: "redis" depends_on:-mongo-redis mongo: # MongoDB Service name image: mongo:latest # MongoDB Image name restart: always # restart policy is always Ports: # mapped port-"27017 Redis 27017" redis: # Redis service name image: redis:latest # Redis image name restart: always # restart policy is "always" ports: # mapped port-"6379vis6379"

Readers can configure docker-compose.yml according to their own requirements. Pay particular attention to the environment variable CRAWLAB_API_ADDRESS, many beginners are unable to log in because of the incorrect configuration of this variable. In most cases, you do not have to make any configuration changes. Please refer to Qaccouna for handling common problems and detailed documentation for configuring environment variables to help configure Crawlab according to your environment.

Then, run the following command to start Crawlab. You can add a-d parameter to let Docker Compose run in the background.

Docker-compose up

After running the above command, Docker Compose will pull the images of MongoDB and Redis, which may take several minutes. After pulling, the four services will start in turn, and you will see the following on the command line.

Docker-compose

Normally, you should see that all four services start successfully and print the log successfully.

If you started Docker Compose on your local machine, you can type http://localhost:8080 in your browser and you can see the login interface; if you started Docker Compose on another machine, you need to enter http://:8080 in the browser to see the login interface, which is the IP address of the other machine (please make sure that port 8080 is open to the public).

Login

The initial login username password is admin/admin, you can use this username password to log in. If your environment variable CRAWLAB_API_ADDRESS is not set correctly, you may see that the login button will continue to circle without any hint after clicking the login button. At this point, please re-set the correct CRAWLAB_API_ADDRESS in docker-compose.yml (replace localhost with) and restart docker-compose up. Then enter http://:8080 in the browser.

After logging in, you will see the home page of Crawlab.

Home

This article focuses on how to build the crawler management platform Crawlab, so it will not describe how to use Crawlab in detail (another article may be created to describe it in detail, which can be followed by interested parties). If you are confused, please check the relevant documentation to learn how to use it.

How to integrate reptiles such as Scrapy into Crawlab

As we all know, Scrapy is a very popular crawler framework, which is widely adopted by many developers and enterprises because of its flexible framework design, high concurrency, ease of use and extensibility. Almost all the crawler management platforms in the market support Scrapy crawlers, and Crawlab is no exception, but Crawlab can run puppeteer, selenium and other crawlers. Here's how to run the scrapy crawler in Crawlab.

Crawlab is the basic principle of executive crawler.

The principle that Crawlab executes a crawler is very simple, it is actually a shell command. The user enters the shell command that executes the crawler in the crawler. For example, the scrapy crawl some_spider,Crawlab executor reads the command and executes it directly in shell. Therefore, every time you run a crawler task, you execute a shell command (of course, the actual situation is much more complicated than this, and you can refer to the official documentation if you are interested). Crawlab supports displaying and exporting crawler results, but this requires a little more work.

Write Pipeline

To integrate the scrapy crawler, it is nothing more than storing the data crawled by the crawler into the Crawlab database and associating it with the task ID. Each time a crawler task is executed, the task ID will be passed to the crawler through environment variables, so all we need to do is to store the task ID plus the result in the database (Crawlab only supports MongoDB now, and relational databases such as MySQL, SQL Server, Postgres and other relational databases will be developed later, and users who need it can follow it.

In Scrapy, we need to write storage logic. The schematic code is as follows:

# introduce related libraries, pymongo is the library import osfrom pymongo import MongoClient# MongoDB configuration parameter MONGO_HOST = '192.168.99.100'MONGO_PORT = 27017MONGO_DB =' crawlab_test'class JuejinPipeline (object): mongo = MongoClient (host=MONGO_HOST, port=MONGO_PORT) # mongo connection instance db = Mongo [Mongo _ DB] # database instance col_name = os.environ.get ('CRAWLAB_COLLECTION') # collection name Passed through the environment variable CRAWLAB_COLLECTION # if CRAWLAB_COLLECTION does not exist, the default collection name is test if not col_name: col_name = 'test' col = DB [col _ name] # collection instance # every function that is called by passing in item The parameters are item and spider def process_item (self, item, spider): item ['task_id'] = os.environ.get (' CRAWLAB_TASK_ID') # Task ID self.col.save (item) # which passes task_id as the environment variable. # Save item in the database return item

At the same time, you also need to add the task_id field to the items.py to ensure that the value can be assigned (which is important).

Upload and configure crawler

Before running the crawler, you need to upload the crawler file to the master node. The steps are as follows:

Package the crawler file into zip (be sure to package directly in the root directory)

Click "Crawler" in the sidebar to navigate to the crawler list, click the "add Crawler" button, and select "Custom Crawler".

Click the upload button to select the zip file that has just been packaged

After the upload is successful, the newly added custom crawler will appear in the crawler list, so that the upload is successful.

You can click the "File" tab in the crawler details, select a file, and edit the code in the file.

Next, you need to enter the crawler's shell execution command in the execute Command column of the Overview tab. Scrapy is built into Crawlab's Docker image, so you can run the scrapy crawler directly. The order is scrapy crawl. Click the Save button to save the crawler configuration.

Run a crawler task

Then it's time to run the crawler task. In fact, it is very simple, click the "run" button in the "overview" tab, and the crawler task will start to run. If the log prompts you that the scrapy command cannot be found, you can change scrapy to the absolute path / usr/local/bin/scrapy, which will run successfully.

The running status of the task will be displayed on the "tasks" page or in the crawler's "overview" and will be updated every 5 seconds. You can check it here. And in the crawler's "results" tab, you can preview the details of the results and export the data to a CSV file.

Building a continuous Integration (CI) workflow

For enterprises, software development is generally an automated process. It will go through the steps of requirements, development, deployment, testing, and launch. This process is generally Iterative and needs to be constantly updated and released.

In the case of a crawler, for example, you go online to a crawler that crawls website data on a regular basis. But suddenly one day you find that the data can not be caught, you quickly locate the reason, and find that it is the revision of the website, and you need to change the crawler crawling rules to deal with the revision of the website. In short, you need to release a code update. The quickest thing to do is to change the code directly online. But doing so is very dangerous: first, you cannot test your updated code, you can only test whether the crawl was successful by constantly adjusting the online code; second, you cannot record the change, and if something goes wrong later, you are likely to ignore the change, resulting in bug. All you need to do is manage your crawler code with version management tools. We have many version management tools, the most commonly used is git, subversion, version management platforms include Gitlab, Bitbucket, self-built Git repository and so on.

When we update the code, we need to publish the updated code to the online server. At this point, you need to write your own deployment scripts, or, more conveniently, use Jenkins as the continuous integration (Continuous Integration) management platform. Jenkins is a continuous integration platform that can update deployment code by obtaining a version library. It is a very useful tool that is useful in many enterprises. The following figure is an example of how to apply the Crawlab crawler to the continuous integration workflow.

Ci

There are two ways to create or update a crawler in Crawlab:

Upload the packaged zip file

By changing the crawler file in the directory CRAWLAB_SPIDER_PATH in the primary node.

When we do continuous integration, we aim at the second approach. The steps are as follows:

Build a code repository with Gitlab or other platforms

Create a project in Jenkins that points the code source to the previously created repository in the project

Write the workflow in the Jenkins project, point the publishing address to the CRAWLAB_SPIDER_PATH of Crawlab, and if it is Docker, mount the address to the host file system.

The work of the Jenkins project can be written directly, or you can use Jenkinsfile. You can check the relevant materials.

In this way, after each code update is submitted to the code repository, Jenkins will publish the updated code to Crawlab, and the Crawlab master node will synchronize the crawler code to the work node for capture.

Thank you for your reading. the above is the content of "how to quickly build a practical crawler management platform". After the study of this article, I believe you have a deeper understanding of how to quickly build a practical crawler management platform. The specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report