How to use Docker Swarm to build distributed crawler cluster 07/16 Update SLTechnology News&Howtos

How to use Docker Swarm to build distributed crawler cluster

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Docker Swarm to build a distributed crawler cluster, which has a certain reference value. Interested friends can refer to it. I hope you can learn a lot after reading this article.

During the crawler development process, you must have encountered situations where you need to deploy the crawler on multiple servers. How do you operate at this time? Log in to each server one by one SSH, use git to pull down the code, and then run? The code has been modified, so you have to log in to a server and update it in turn.

Sometimes the crawler only needs to run on one server, and sometimes it needs to run on 200 servers. How do you switch quickly? One server by one server login switch? Or is it smarter to set a tag that can be modified in Redis so that only the crawlers on the server corresponding to the tag run?

A crawler has been deployed on all servers, and now that you have made another B crawler, do you have to log in to each server and deploy again?

If you do do this, then you should regret not reading this article earlier. After reading this article, you can do this:

Deploy a new crawler to 50 servers in 2 minutes:

Docker build-t localhost:8003/spider:0.01 .docker push localhost:8002/spider:0.01docker service create-- name spider-- replicas 50-- network host 45.77.138.242:8003/spider:0.01

Expand the crawler from 50 servers to 500 servers in 30 seconds:

Docker service scale spider=500

Batch shutdown of crawlers on all servers within 30 seconds:

Docker service scale spider=0

Batch update crawlers on all machines within 1 minute:

Docker build-t localhost:8003/spider:0.02 .docker push localhost:8003/spider:0.02docker service update-- image 45.77.138.242:8003/spider:0.02 spider

This article won't teach you how to use Docker, so make sure you have some Docker basics before you read this article.

What is Docker Swarm?

Docker Swarm is a cluster management module that comes with Docker. He can create and manage Docker clusters.

Environment building

This article will use 3 Ubuntu 18.04 servers for demonstration. The three servers are arranged as follows:

Master:45.77.138.242

Slave-1:199.247.30.74

Slave-2:95.179.143.21

Docker Swarm is a Docker-based module, so first install Docker on three servers. After installing Docker, all operations are done in Docker.

Install Docker on Master

Install Docker on the Master server by executing the following command in turn

Create a Manager node

A Docker Swarm cluster requires Manager nodes. Now initialize the Master server as the Manager node of the cluster. Run the following command.

Docker swarm init

After the run is complete, you can see the return result as shown in the following figure.

In this return result, a command is given:

The copy code is as follows:

Docker swarm join-- token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv 45.77.138.242Ru 2377

This command needs to be executed in each slave node (Slave). Now write down this order.

After initialization, you get a Docker cluster with only one server. Execute the following command:

Docker node ls

You can see the current status of the cluster, as shown in the following figure.

Create a private source (optional)

Creating a private source is not a required operation. Private sources are needed because the Docker image of the project may involve company secrets and cannot be uploaded to a public platform such as DockerHub. If your images can be uploaded to DockerHub publicly, or if you already have a private mirror source that you can use, you can use them directly and skip this section and the next section.

The private source itself is also an image of Docker. Pull it down first:

Docker pull registry:latest

This is shown in the following figure.

Now start the private source:

The copy code is as follows:

Docker run-d-p 8003VR 5000-- name registry-v / tmp/registry:/tmp/registry docker.io/registry:latest

This is shown in the following figure.

In the startup command, the open port is set to port 8003, so the address of the private source is: 45.77.138.242

Tip:

The private source built in this way is HTTP, and there is no permission verification mechanism, so if it is open to the public network, you need to use a firewall to make an IP whitelist to ensure the security of the data.

Allow docker to use trusted http private sources (optional)

If you set up your own private source using the commands in the previous section, since Docker does not allow the use of HTTP private sources by default, you need to configure Docker to make Docker trust it.

Use the following command to configure Docker:

Echo'{"insecure-registries": ["45.77.138.242 8003"]}'> > / etc/docker/daemon.json

Then restart docker using the following command.

Systemctl restart docker

This is shown in the following figure.

After the restart is complete, the Manager node is configured.

Create a child node initialization script

For Slave servers, there are only three things you need to do:

Install Docker

Join the cluster

Trust source

From then on, the rest is left to Docker Swarm himself, and you no longer have to log in to the server with SSH.

To simplify the operation, you can write a shell script to run in batches. Create an init.sh file under the Slave-1 and Slave-2 servers with the following contents.

Apt-get updateapt-get install-y apt-transport-https ca-certificates curl software-properties-commoncurl-fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add- add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable" apt-get updateapt-get install-y docker-ceecho'{"insecure-registries": ["45.77.138.242apt-get updateapt-get install 8003"]}'> > / etc/docker/daemon.jsonsystemctl restart Docker docker swarm join-- token SWMTKN-1-0hqsajb64iynkg8ocp8uruktii5esuo4qiaxmqw2pddnkls9av-dfj7nf1x3vr5qcj4cqiusu4pv 45.77.138.242Ru 2377

Set this file as a self-contained file, and run:

Chmod + x init.sh./init.sh

This is shown in the following figure.

After waiting for the script to run, you can log out from the SSH of Slave-1 and Slave-2. And you don't have to come in anymore.

Go back to the Master server and execute the following command to confirm that the cluster now has three nodes:

Docker node ls

You can see that there are already three nodes in the cluster. This is shown in the following figure.

By now, the most complex and troublesome process is over. All that's left is to experience the convenience of Docker Swarm.

Create a test program

Build and test Redis

Since we need to simulate the running effect of a distributed crawler, we first use Docker to build a temporary Redis service:

Execute the following command on the Master server:

The copy code is as follows:

Docker run-d-name redis-p 7891 redis 6379-- requirepass "KingnameISHandSome8877"

This Redis uses port 7891, the password is KingnameISHandSome8877, and IP is the IP address of the Master server.

Write a test program

Write a simple Python program:

Import timeimport redisclient = redis.Redis (host='45.77.138.242', port='7891', password='KingnameISHandSome8877') while True: data = client.lpop ('example:swarm:spider') if not data: break print (f' I now get the data: {data.decode ()}') time.sleep (10)

This Python reads a number from the Redis every 10 seconds and prints it out.

Write Dockerfile

Write Dockerfile and create our own image based on the Python3.6 image:

From python:3.6label mantainer=' [email protected] 'user rootENV PYTHONUNBUFFERED=0ENV PYTHONIOENCODING=utf-8run python3-m pip install rediscopy spider.py spider.pycmd python3 spider.py

Build an image

After writing the Dockerfile, execute the following command to start building our own image:

Docker build-t localhost:8003/spider:0.01.

It should be noted here that since we want to upload this image to a private source for download from the slave node on the Slave server, the naming of the image needs to meet the format of localhost:8003/ custom name: version number. The custom name and version number can be modified according to the actual situation. In the example in this article, I named it spider because I want to simulate a crawler program, and because it is the first build, the version number is 0.01m.

The whole process is shown in the following figure.

Upload image to private source

After the image is built, you need to upload it to a private source. At this point, you need to execute the command:

Docker push localhost:8003/spider:0.01

This is shown in the following figure.

Remember this build and upload command, and you need to use these two commands every time you update the code.

Create a service

Docker Swarm runs one service at a time, so you need to use the docker service command to create the service.

The copy code is as follows:

Docker service create-name spider-network host 45.77.138.242:8003/spider:0.01

This command creates a service called spider. 1 container runs by default. The operation is shown in the following figure.

Of course, you can also run it with many containers as soon as it is created, and you only need to add one-- replicas parameter. For example, as soon as the service is created, it will be run with 50 containers:

The copy code is as follows:

Docker service create-name spider-replicas 50-network host 45.77.138.242:8003/spider:0.01

However, generally speaking, the initial code may have a lot of bug, so it is recommended to use a container to run it first, observe the log, and find that there is no problem before expanding it.

Going back to the default case of 1 container, this container could be on any of the three machines currently. Observe the operation of this default container by executing the following command:

Docker service ps spider

This is shown in the following figure.

View Node Log

According to the execution result of the above figure, you can see that the ID of this running container is rusps0ofwids, so execute the following command to view the Log dynamically:

Docker service logs-f container ID

At this point, the Log of this container is continuously tracked. This is shown in the following figure.

Horizontal expansion

Now, only one server is running a container, and I want to use three servers to run the crawler, so I need to execute a command:

Docker service scale spider=3

The running effect is shown in the following figure.

At this point, if you look at the operation of the crawler again, you can see that each of the three machines will run a container. This is shown in the following figure.

Now, let's log in to the slave-1 machine and see if there is really a task running. This is shown in the following figure.

You can see that there is indeed a container running on it. This is automatically assigned by Docker Swarm.

Now let's use the following command to forcibly turn off the Docker on slave-1, and see how it works.

Systemctl stop docker

Go back to the master server and see how the crawler works again, as shown in the following figure.

As you can see, after Docker Swarm detects that Slave-1 is offline, he will automatically find a new machine to start the task to ensure that there are always three tasks running. In this example, Docker Swarm automatically starts two spider containers on the master machine.

If the performance of the machine is good, you can even run a few more containers on each machine:

Docker service scale spider=10

At this point, 10 containers will be launched to run the crawlers. The 10 reptiles are isolated from each other.

What if you want all the reptiles to stop? Very simple, one command:

Docker service scale spider=0

So all the reptiles will stop.

View logs of multiple containers at the same time

What if you want to see all the containers at the same time? You can view the latest 20 lines of logs for all containers using the following command:

Docker service ps robot | grep Running | awk'{print $1}'| xargs-i docker service logs-- tail 20 {}

In this way, the logs will be displayed in order. This is shown in the following figure.

Update crawler

If your code has been modified. Then you need to update the crawler.

First modify the code, rebuild, and resubmit the new image to the private source. This is shown in the following figure.

Next, you need to update the image in the service. There are two ways to update an image. One is to turn off all reptiles and then update them.

Docker service scale spider=0docker service update-image 45.77.138.242:8003/spider:0.02 spiderdocker service scale spider=3

The second is to execute the update command directly.

Docker service update-image 45.77.138.242:8003/spider:0.02 spider

The difference between them is that when the update command is executed directly, the running container will be updated one by one.

The running effect is shown in the following figure.

You can do more with Docker Swarm.

This article uses an example of a simulated crawler, but obviously, any program that can be run in batch can be run with Docker Swarm, whether you use Redis or Celery to communicate, whether you need to communicate or not, as long as you can run it in batch, you can use Docker Swarm.

In the same Swarm cluster, you can run multiple different services that do not affect each other. You can really build a Docker Swarm cluster once, and then you don't have to worry about it anymore. You only need to run all the operations on the server where the Manager node is located.

Thank you for reading this article carefully. I hope the article "how to use Docker Swarm to build a distributed crawler cluster" shared by the editor will be helpful to you. At the same time, I also hope you can support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.