Zookeeper cross-region high availability scheme 07/15 Update SLTechnology News&Howtos

Zookeeper cross-region high availability scheme

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Recently, due to business requirements, testing the high availability of various components. Because our environment is deployed in AWS Beijing. There are only two Aviable Zone (availability zones).

Note: there are two data centers that need to do disaster recovery with each other, which is the same as the situation tested in this article.

On the other hand, Zookeeper needs more than 3 odd nodes to work at the same time, and more than half of the nodes must survive and provide services normally.

Then, in the case of only two AZ, no matter how you plan, there is a probability that more than half of the AZ will fail, resulting in the entire Zookeeper unavailable.

So, what we can do is, after the AZ is dead, how can we deal with it as soon as possible and restore the environment?

We prepare two machines with software installed and parameters configured. After the availability zone 1 is completely hung up, you can manually start the two standby nodes. Increase the number of Zookeeper in availability Zone 2 by more than half. The Zookeeper service can be restored in availability Zone 2.

Refer to the following figure:

Can the above ideas be realized?

Then let's test it today.

1. A total of five machines were prepared for testing.

2. Download and install Zookeeper.

2.1 official download address of Zookeeper

Https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/

2.2 download the software

Wget https://mirrors.tuna.tsinghua.edu.cn/apache/zookeeper/zookeeper-3.4.14/zookeeper-3.4.14.tar.gz

2.3 for detailed Zookeeper installation steps, please refer to:

Https://blog.51cto.com/hsbxxl/1971241

2.4 configuration of zoo.cfg # cat zoo.cfg

TickTime=2000initLimit=10syncLimit=5dataDir=/data/zookeeper/datadataLogDir=/data/zookeeper/logclientPort=2181autopurge.snapRetainCount=3autopurge.purgeInterval=6server.1=172.31.9.73:2888:3888server.2=172.31.20.233:2888:3888server.3=172.31.26.111:2888:3888server.4=172.31.17.68:2888:3888server.5=172.31.16.33:2888:3888

2.5 create data and log folders according to zoo.cfg

Mkdir-p / data/zookeeper/data mkdir-p / data/zookeeper/log

2.6 modify the file according to the node number

Echo 1 > / data/zookeeper/data/myid

3. A total of 5 EC2 have been prepared for testing, and Zookeeper has been installed.

But only three machines are started, and the other two machines are used as standby

As you can see in the following figure, three have already started zookeeper

Note that in the process of starting Zookeeper, three or more zookeeper clusters must be guaranteed to work properly.

4. Next, I started shutting down the machines one by one to see the status of zookeeper.

Currently, leader is on zk3. Let's shut down zk1 and then close zk3 to see if Leader will float onto zk2.

4.1 execute kill on zk1 to kill the process

[root@ip-172-31-9-73] # jps12438 Jps7545 QuorumPeer Main [root @ ip-172-31-9-73] # zkServer.sh statusZooKeeper JMX enabled by defaultUsing config: / root/zookeeper-3.4.14/bin/../conf/zoo.cfgMode: follower [root @ ip-172-31-9-73] # kill-9 7545

Link zk3 through zkCli on zk5 and query data.

After the kill process on zk1, theoretically, there are still zk2 and zk3 alive, but the connection to zkCli shows that an error has been reported.

[root@ip-172-31-16-33 bin] #. / zkCli.sh-server 172.31.26.111:2181Connecting to 172.31.26.111 server 172.31.26.111:2181Connecting to 2181. [zk: 172.31.26.111 server 172.31.26.111:2181Connecting to 2181 (CONNECTED) 0] ls / [zk-permanent, zookeeper Test] [zk: 172.31.26.111VO2181 (CONNECTED) 1] 2019-06-23 07VO7 Vista 28VH 06581 [myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1158]-Unable to read additional data from server sessionid 0x30000c504530000, likely server has closed socket Closing socket connection and attempting reconnect.2019-06-23 07 INFO [myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1025]-Opening socket connection to server ip-172-31-26-111.cn-north-1.compute.internal/172.31.26.111:2181. Will not attempt to authenticate using SASL (unknown error) 2019-06-23 07 myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@879]-Socket connection established to ip-172-31-26-111.cn-north-1.compute.internal/172.31.26.111:2181 Initiating session2019-06-23 07 INFO [myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1158]-Unable to read additional data from server sessionid 0x30000c504530000, likely server has closed socket, closing socket connection and attempting reconnect

We continue to kill the process on zk3, leaving only the process on zk2. But we can't confirm whether zk2 is Leader or Follow, or whether he still has data.

[root@ip-172-31-26-111bin] # jps4183 QuorumPeerMain4648 JPs [root @ ip-172-26-111bin] # kill-9 4183 [root@ip-172-31-26-111bin] # jps4658 Jps

4.4 after the process kill is dropped on zk3, the link is not just the error reported above, but the direct connection rejection

[root@ip-172-31-16-33 bin] #. / zkCli.sh-server 172.31.26.111:2181Connecting to 172.31.26.111:2181.Welcome to ZooKeepermates 2019-06-23 07. 35 111.cn [myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1025]-Opening socket connection to server ip-172-31-26-111.cn -north-1.compute.internal/172.31.26.111:2181. Will not attempt to authenticate using SASL (unknown error) JLine support is enabled2019-06-23 07 myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1162]-Socket error occurred: ip-172-31-26-111.cn-north-1.compute.internal/172.31.26.111:2181: Connection refused [zk: 172.31.26.111purl 2181 (CONNECTING) 0] 2019 -06-23 07 main-SendThread 3515 111.cn-north-1.compute.internal:2181 1939 [myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1025]-Opening socket connection to server ip-172-31-26-111.cn-north-1.compute.internal/172.31.26.111:2181. Will not attempt to authenticate using SASL (unknown error) 2019-06-23 07 myid:]-INFO [main-SendThread (ip-172-31-26-111.cn-north-1.compute.internal:2181): ClientCnxn$SendThread@1162]-Socket error occurred: ip-172-31-26-111.cn-north-1.compute.internal/172.31.26.111:2181: Connection refused

4.5 you can see that the process on zk2 is still there.

# jps5155 QuorumPeerMain5211 Jps

4.6and you can check that port 2181 of zk2 is still in service with the following command

# echo ruok | nc localhost 2181imok

But there is no normal output for other commands, only echo ruok | nc localhost 2181 outputs ok.

# echo ruok | nc 172.31.16.33 2181imok [root @ ip-172-31-16-33 bin] # echo conf | nc 172.31.16.33 2181This ZooKeeper instance is not currently serving requests# echo dump | nc 172.31.16.33 2181This ZooKeeper instance is not currently serving requests

4.8 ZooKeeper four-word command

ZooKeeper four-word command

Function description

Conf

Output the details of the relevant service configuration.

Cons

Lists the full connection / session details of all clients connected to the server. Includes information about the number of packets "accepted / sent", session id, operation delay, last operation execution, and so on.

Dump

Lists unhandled sessions and temporary nodes.

Envi

Outputs detailed information about the service environment (different from the conf command).

Reqs

List unprocessed requests

Ruok

Test whether the service is in the correct state. If so, the service returns "imok", otherwise no response is made.

Stat

Output a list of clients about performance and connections.

Wchs

Lists details of the server watch.

Wchc

The details of the server watch are listed through session, and its output is a list of watch-related sessions.

Wchp

Lists the details of the server watch by path. It outputs a path related to session.

4.9 under normal circumstances, the above commands can be output:

# echo dump | nc 172.31.20.233 2181

SessionTracker dump:org.apache.zookeeper.server.quorum.LearnerSessionTracker@77714302ephemeral nodes dump:Sessions with Ephemerals (0):

# echo conf | nc 172.31.20.233 2181

ClientPort=2181dataDir=/data/zookeeper/data/version-2dataLogDir=/data/zookeeper/log/version-2tickTime=2000maxClientCnxns=60minSessionTimeout=4000maxSessionTimeout=40000serverId=2initLimit=10syncLimit=5electionAlg=3electionPort=3888quorumPort=2888peerType=0

# echo envi | nc 172.31.20.233 2181

Environment:zookeeper.version=3.4.14-4c25d480e66aadd371de8bd2fd8da255ac140bcf GMThost.name=ip-172-31-20-233.cn-north-1.compute.internaljava.version=1.8.0_212java.vendor=Oracle Corporationjava.home=/usr/java/jdk1.8.0_212-amd64/jrejava.class.path=/root/zookeeper-3.4.14/bin/../zookeeper-server/target/classes:/root/zookeeper-3.4.14/bin/../build/classes:/root/zookeeper-3.4.14/bin, 2019 16:18 GMThost.name=ip-172-31-20-built on /.. / zookeeper-server/target/lib/*.jar:/root/zookeeper-3.4.14/bin/../build/lib/*.jar:/root/zookeeper-3.4.14/bin/../lib/slf4j-log4j12-1.7.25.JarVERVERA RootDash ZookeeperMel 3.4.14qbinAccording to LBG slf4jmurapiMel 1.7.25.JarJUTER ZOOKEPERIMUR 3.4.14BINT... 3.10.6.Final.jar:/root/zookeeper-3.4.14/bin/../lib/log4j-1.2.17.jar:/root/zookeeper-3.4.14/bin/../lib/jline-0.9.94.jar:/root/zookeeper-3.4.14/bin/../lib/audience-annotations-0.5.0.jar:/root/zookeeper-3.4.14/bin/../zookeeper-3.4.14. Jar:/root/zookeeper-3.4.14/bin/../zookeeper-server/src/main/resources/lib/*.jar:/root/zookeeper-3.4.14/bin/../conf:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/libjava.io.tmpdir=/tmpjava.compiler=os.name=Linuxos.arch=amd64os.version=4.14.123-86.109.amzn1.x86_64user.name=rootuser.home=/rootuser.dir=/root/zookeeper-3.4.14/bin

5. At this time, I go to start the other two backup nodes, zk4,zk5. Both nodes are started for the first time.

6. When you connect to the zookeeper again, you can see that at least the data is not lost.

[root@ip-172-31-16-33 bin] #. / zkCli.sh-server 172.31.16.33:2181Connecting to 172.31.16.33 server 172.31.16.33:2181Connecting to 2181. [zk: 172.31.16.33 server 172.31.16.33:2181Connecting to 2181 (CONNECTED) 0] ls / [zk-permanent, zookeeper, test]

7. Through the above tests, it seems to achieve our expected results. The only small problem is: we have three nodes, why can't we shut down one node and shut down the remaining two?

In fact, there is a "take it for granted" small problem.

We thought that only three would be activated. In fact, the Zookeeper cluster identifies 5. Why?

How does Zookeeper identify how many nodes there are in the cluster? Of course, it doesn't depend on "taking it for granted". There must be a configuration file to tell it. Zookeeper, there are only two configuration files zoo.cfg and myid.

Then only zoo.cfg will affect it.

8. After I make the following changes to zoo.cfg. Only 3 nodes are turned on, and after closing one node, it can still operate normally.

Comment out server2 and server5

# cat zoo.cfgtickTime=2000initLimit=10syncLimit=5dataDir=/data/zookeeper/datadataLogDir=/data/zookeeper/logclientPort=2181autopurge.snapRetainCount=3autopurge.purgeInterval=6server.1=172.31.9.73:2888:3888#server.2=172.31.20.233:2888:3888server.3=172.31.26.111:2888:3888server.4=172.31.17.68:2888:3888#server.5=172.31.16.33:2888:3888

9. After shutting down server4, server2 and server3 are still alive.

[root@ip-172-31-26-111l] # zkServer.sh statusZooKeeper JMX enabled by defaultUsing config: / root/zookeeper-3.4.14/bin/../conf/zoo.cfgMode: Leader [root @ ip-172-31-9-73] # zkServer.sh statusZooKeeper JMX enabled by defaultUsing config: / root/zookeeper-3.4.14/bin/../conf/zoo.cfgMode: follower

10. To sum up, if we consider the case of two AZ, how can we recover quickly if the AZ with a large number of zookeeper nodes has a disaster?

(suppose Server1/Server2 is in 1AZ, Magazine, Server3, and Server4, and Server5 is in 2AZ.)

10.1. In the AZ with a small number of Zookeeper nodes, prepare 2 more EC2 with zookeeper configured and shut down the machine for use. The specific zoo.cfg configuration of Server4/Server5 is as follows

TickTime=2000initLimit=10syncLimit=5dataDir=/data/zookeeper/datadataLogDir=/data/zookeeper/logclientPort=2181autopurge.snapRetainCount=3autopurge.purgeInterval=6server.3=172.31.26.111:2888:3888server.4=172.31.17.68:2888:3888server.5=172.31.16.33:2888:3888

10.2. Server1/Server2/Server3, which is a running node, is configured as follows:

10.3. In the event of a disaster, when the 1AZ where Server1/Server2 is located hangs, you need to intervene manually, change the configuration of Server3 to the following configuration, restart the zookeeper service of Server3, and then start Server4/Server5. Be sure to start Server3 first, and pay attention to the order.

TickTime=2000initLimit=10syncLimit=5dataDir=/data/zookeeper/datadataLogDir=/data/zookeeper/logclientPort=2181autopurge.snapRetainCount=3autopurge.purgeInterval=6server.3=172.31.26.111:2888:3888server.4=172.31.17.68:2888:3888server.5=172.31.16.33:2888:3888

10.4 Daily operating status

10.5 check the znode information that has been created

. / zkCli.sh-server 172.31.16.33 ls / Connecting to 172.31.16.33 ls 2181 [zk-permanent, zookeeper, test]

10.6 close Server1/Server2, pay attention to the order, close follow first, if you close leader first, the switch will occur. What we expect is that Server3 will eventually survive as follow.

11. In the end, we can see the test results, and everything is going in the direction we "take for granted".

twelve。 Finally, verify that the znode data in zookeeper still exists.

. / zkCli.sh-server 172.31.16.33 ls / Connecting to 172.31.16.33 ls 2181 [zk-permanent, zookeeper, test]

13. In fact, the data has always been in this path, as long as a node is still preserved, it will be saved.

# ls / data/zookeeper/data/myid version-2 zookeeper_server.pid

Note: be sure to make sure that the next two paths of Server4/Server5 are empty, otherwise it will appear, and Server4/Server5 identifies the old information before.

/ data/zookeeper/data/version-2/data/zookeeper/log/version-2

14. At this point, we can understand that all the data of Zookeeper is stored in the following two paths. If you need to make a backup, you can do a cp backup directly at the OS level.

DataDir=/data/zookeeper/datadataLogDir=/data/zookeeper/log

Derived from an idea, that is, if you want to do Region, Beijing (main environment) to Ningxia (disaster recovery environment) of the high availability of zookeeper how to do?

We can consider backing up the data files of zookeeper in Beijing regularly and importing them into the environment of Ningxia.

Specific steps:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.