Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the core principle of Crawlab?

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain in detail what is the core principle of Crawlab. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Why do you need a crawler management platform

For the average crawler, writing a stand-alone crawler script is enough, and an excellent crawler framework like Scrapy makes it easy for developers to debug and write a simple crawler, which requires timing tasks to be directly on Crontab and done in minutes. However, general enterprises have relatively high requirements for crawlers, which mainly involves a problem, that is, Scale. Of course, the scale here refers to the large scale. There are two types of crawlers: one is that the crawler needs to grab a large amount of data (Volume), such as the whole web to grab Taobao goods; the other is that the crawler needs to cover a large number of websites (Coverage), such as search engines.

Different sizes require different architectural strategies (as shown below):

When there is only one website and there are not many crawling results, you only need a single machine, not a distributed crawler.

However, when you need to improve the level of crawling results, such as crawling Taobao for the whole network, you need to use distributed crawlers, because the bandwidth and computing resources of a single machine are not enough to crawl the whole network, and a large number of proxy IP are needed to deal with anti-crawler technology.

Similarly, when the number of websites that need to be crawled increases, for example, you need to create a news search engine, you also need multiple machines to get enough bandwidth and computing resources.

For applications that require both Volume and Coverage, not ordinary small businesses or individuals can do, for both human and machine resources are very high.

The crawler management platform is a distributed management platform for situations (2), (3) and (4), which allows users to easily manage multiple crawlers or crawlers running on multiple machines.

Crawlab has solved the problem of distributed crawler since its birth. It first uses Celery as the distributed task scheduling engine, Redis as the message queue and HTTP request as the node communication medium to simply realize the distributed management. However, as users continue to use Crawlab, it is not very convenient to find that this way is not very convenient. Users need to specify the IP address and API port of the node, and they cannot specify the node to perform the task. Because of various problems, when the latest version v0.3.0 reconstructs the back end with Golang, Celery is abandoned and developed by ourselves for monitoring and communication applications of distributed nodes, which is more flexible and efficient. This article is an introduction to the core principles, and the following will focus on the distributed architecture principles of Crawlab (Golang version).

Overall architecture

The overall structure of Crawlab, as shown in the following figure, consists of five parts:

Master node (Master Node): responsible for task dispatch, API, deployment of crawlers, etc.

Work node (Worker Node): responsible for performing crawler tasks

MongoDB database: storing daily running data such as nodes, crawlers, tasks, etc.

Redis database: stores information such as task message queue, node heartbeat, etc.

Front-end client: Vue application, which is responsible for front-end interaction and data request from the back-end.

Let's take a crawler task as an example, which is a common usage scenario in Crawlab. Let's take a look at how it works:

The front end initiates a request to the master node to specify that the task be performed on a certain work node

The master node receives the request and pushes the task data to the Redis task queue

The worker node continuously listens to the Redis task queue and uses LPOP to obtain the task

The worker node performs the task and writes the results back to the storage database

This is the general process of performing a crawler task. Of course, this is not all, we also need to consider log processing, concurrent execution, canceling tasks and other details. For specific processing information, please check the relevant documentation and source code.

Generally speaking, the master node can be regarded as the central control system of the overall architecture of Crawlab, which can be understood as the brain of Crawlab; the work node is the actual working part and the moving body of Crawlab; MongoDB and Redis are responsible for communication and can be regarded as the blood and neural network of Crawlab. Together, these modules constitute a complete, self-consistent and cooperative system.

Node registration and monitoring

Node monitoring is mainly done through Redis (as shown in the following figure).

The work node will constantly update the heartbeat information on the Redis. Using HSET nodes, the heartbeat information includes the node MAC address, IP address, and current timestamp.

The master node periodically acquires the heartbeat information of the working node on the Redis. If there is a worker node whose timestamp is less than 60 seconds, considering that the node is offline, the node's information will be deleted in Redis and set to "offline" in MongoDB; if the timestamp is within the past 60 seconds, the node information will be retained and set to "online" in MongoDB.

Advantages of this architecture

In this way, a node registration system is achieved to monitor whether the node is online. The advantage of this architecture is that there is no IP or port like HTTP or RPC between nodes. You only need to know the address of Redis to complete node registration and monitoring. Therefore, the operation of configuring nodes by users is reduced, and the use process is simplified. At the same time, it is more secure because the IP address and port are hidden. In addition, compared with the Celery version of monitoring, we remove the Flower service, do not have to start a separate Flower service process in the service, reducing the overhead.

The following figure is a diagram of the relationship between nodes on the Crawlab UI interface (topology diagram).

Shortcomings of the architecture

Compared with some common distributed architectures, such as Zookeeper,Crawlab, there are still some imperfections.

High availability (High Availability) is something that Crawlab has not done very well yet. For example, when the master node goes down, the whole system will be paralyzed, because the master node is the brain center of Crawlab and is responsible for many functions. If the primary node goes down, the front end cannot get API data, the task cannot be scheduled, and of course the node cannot be monitored. Although Zookeeper does not do Availability very well, its voting mechanism ensures its high availability to a certain extent. If Crawlab wants to improve this, it will select another master node in a certain way after the primary node goes down to ensure high availability.

Node communication

If you look closely at the overall architecture diagram above, you may notice that there are two types of communication in Crawlab. One is synchronization messages (Sync via Msg), and the other is dispatching tasks (Assign Tasks). These two kinds of communication are called instant messaging and delayed communication respectively. The following are described separately.

Instant messaging

Instant messaging means that one node A sends information to another node B through a certain medium. depending on whether it is two-way communication or not, node B may reply the information to node A through the same medium.

Crawlab's instant messaging is achieved through Redis's PubSub (see figure below).

The so-called PubSub is simply a publish-subscribe model. Subscribers (Subscriber) subscribe to (Subscribe) a channel on Redis, and any other node can publish (Publish) messages on that channel as a Publisher.

In Crawlab, the master node subscribes to the nodes:master channel, and if other nodes need to send messages to the master node, they only need to publish messages to nodes:master. Similarly, each worker node subscribes to its own channel nodes: (node_id is the ID and MongoDB ObjectId of the node in MongoDB). If you need to send a message to the worker node, you only need to publish the message to this channel.

The simple process of a network request is as follows:

The client (front-end application) sends a request to the master node (API)

The master node passes through the Redis PubSub's

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report