What are the necessary knowledge points of distributed coordination service component Zookeeper 04/17 Update SLTechnology News&Howtos

What are the necessary knowledge points of distributed coordination service component Zookeeper

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what are the necessary knowledge points of the distributed coordination service component Zookeeper". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what are the necessary knowledge points of the distributed coordination service component Zookeeper.

Introduction to Zookeeper, a necessary knowledge point for distributed coordination service component Zookeeper

Zookeeper is an open source distributed coordination service component. It is the monitor of the distributed cluster and can give timely feedback on the state changes of the nodes in the cluster.

Zookeeper guarantees distributed consistency:

Order consistency: ensure that the execution order of the server is executed according to the sending order of the client operation.

Atomicity: the client operation results in either success or failure.

Single view: no matter which server the client connects to, the view you see is consistent.

Reliability: if the client operation request must be executed successfully, the execution result will persist until the next operation

Ultimate consistency: ensure that the view seen by the client within a certain period of time is up-to-date (allowing the data to be seen within a certain period of time is not up-to-date)

By default, Zookeeper ensures that the view received by the client is the order consistency view of the connected server. By calling SyncCommand, the client ensures that the view seen by the client is the latest strongly consistent view of all nodes in the cluster at this time.

Note: final consistency exists over a period of time. The view seen by the client is not necessarily the latest data.

Zookeeper commonly used APIZooKeeper-server host:port-client-configuration properties-file cmd args addWatch [- m mode] path # optional mode is one of [PERSISTENT PERSISTENT_RECURSIVE]-default is PERSISTENT_RECURSIVE addauth scheme auth close config [- c] [- w] [- s] connect host:port create [- s] [- c] [- t ttl] path [data] [acl] delete [- v version] path deleteall path [- b batch size] delquota [- n |-b] path get [- S] [- w] path getAcl [- s] path getAllChildrenNumber path getEphemerals path history listquota path ls [- s] [- w] [- R] path printwatches on | off quit reconfig [- s] [- v version] [[- file path] | [- members serverID=host:port1:port2] Port3 [,...] *]] | [- add serverId=host:port1:port2 Port3 [,...]] * [- remove serverId [,...] *] redo cmdno removewatches path [- c |-d |-a] [- l] set [- s] [- v version] path data setAcl [- s] [- v version] [- R] path acl setquota-n |-b val path stat [- w] path sync path version whoamiZookeeper data model

Zookeeper data is a hierarchical namespace stored in memory, similar to a multi-level file directory structure, except that each namespace can mount data. Namespaces should not use the following characters:\ u0000,\ u0001 -\ u001F,\ u007F,\ ud800-uF8FF,\ uFFF0-uFFFF, reserved keyword: zookeeper. Namespaces are absolute paths, there are no relative paths, and are hierarchical with slash path delimiters.

Zookeeper is developed by Java language, the node is called Znode, the tree storage structure object is: DataTree, and the node storage object is: ConcurrentHashMap nodes.

Zookeeper data node type

Zookeeper supports four types of nodes, and nodes can only be created one by one: v3.5.3 + supports setting TTL on persistent nodes

Temporary node (EPHEMERAL): the node life cycle binds the client session, which is automatically deleted when the client ends the session. A new node cannot be created under the node.

Persistent node (PERSISENT): the persistent disk is saved and the restart is not lost. You need to manually call the delete command to delete.

Container node (CONTAINER): this node is automatically deleted when its child node is completely deleted

Time-limited node (TTL): increases the time limit for persistent nodes. This feature is disabled by default. The enabled attribute is: zookeeper.extendedTypesEnabled.

Number of Zookeeper cluster nodes

Zookeeper follows the ZAB protocol in distributed coordination, which is more than half effective, and requires the number of cluster nodes to conform to: 2N+1. That is, you can deploy a cluster with at least 3 nodes and a maximum of 1 downtime, otherwise the cluster will not be available (it will never be more than half, nor will it be switched to stand-alone operation and will report an error). This also shows that Zookeeper guarantees CP, consistency and fault tolerance of partitions in the CAP theorem in distribution. Note: at least 2 machines in the Zookeeper cluster can run, but any one of the clusters will not be available when it is down.

Role and status of Zookeeper Cluster Node

When Zookeeper starts, a thread will be started to poll the server status, and each cluster node service will only have the following states:

LOOKING: indicates that there are no Leader nodes in the current cluster, and all the nodes in the cluster that can participate in the election are electing Leader nodes.

LEADING: indicates that the node is the leader leader in the cluster

The leader role has the ability to provide read and write views to the outside. Only Leader nodes in the cluster can perform transactions (add, delete, modify). Other nodes need to be forwarded to Leader for execution.

FOLLOWING: indicates that the node is the follower follower in the cluster, indicating that some other node in the current cluster is already a Leader node.

The follower role has the ability to synchronize view data, forward transaction requests to Leader and provide external view reading, which is configured as participant: participant in Zookeeper

OBSERVING: indicates that the node is the observer Observer in the cluster

The Observer role mainly extends the non-transaction processing capacity of the cluster, provides the ability to read views and forward transaction requests to the Leader, and does not have the ability to vote.

Observer and Follower are collectively called learner

Two-phase submission 2PC of Zookeeper

The two-phase commit is to divide the transaction operation into two stages to ensure the consistency of data in the distributed system.

The first phase: the leader service node asks all Follower nodes in the cluster whether the transaction request can be executed by proposal, and each Follower node confirms the proposal ACK.

If the decision fails, the transaction rollback is performed.

Transactions are generally implemented in the form of log files, and zookeeper is no exception. In the first phase, a transaction request is logged and then a query is issued. After receiving the confirmation from each Follower node in the second phase, update the database to complete the transaction according to the previous log.

Zookeeper's ZAB (ZooKeeper Atomic Broadcast) Protocol: zookeeper Atomic broadcast Protocol

It mainly refers to two kinds of running states, which enter the crash recovery state when the Leader is lost, conduct the Leader election, enter the atomic broadcast protocol after the election, and the Leader node broadcasts the transaction request to the Follower node.

Leader Election of Zookeeper

A node with Leader role is required in a Zookeeper cluster. When there is no Leader node, it needs to be elected within the cluster. This situation can be divided into two types: 1: when the cluster is started for the first time, and 2: when the Leader node in the cluster is disconnected, the election process is the same, except that Leader does not exist when the former is started, and Leader is found to be missing at runtime.

When the Zookeeper node starts, the default state is: LOOKING, that is, there is no Leader node in the cluster, and the Leader election is required. Monitoring the status of a node is an endless cycle, which starts with the start of the service and ends with the stop of the service, so the first startup will immediately start the Leader election according to the election algorithm. There is only one v3.5 + election algorithm: FastLeaderElection.

There are four very important data in Zookeeper ballots: myid (cluster service ID), zxid (transaction ID), peerEpoch (leader round), and electionEpoch (election round).

Myid represents the service unique ID of the cluster node, which is the ID configured in the configuration file and is the identification code.

Zxid represents the transaction log ID recorded in the cluster node, which is a 64-bit Long type number, where the high 32 bits are epoch and the low 32 bits are self-increment counter counter.

Epoch indicates the rotation of Leader in the current cluster. Each Leader change will increase by 1, which is 32 bits higher than Zxid.

ElectionEpoch indicates that the round of Leader election has been conducted in the cluster since the node is started, and there are multiple rounds within each Leader election.

The zxid structure is as follows

Public static long makeZxid (long epoch, long counter) {return (epoch curEpoch) | | (newEpoch = = curEpoch) & & (newZxid > curZxid) | | (newZxid = = curZxid) & & (newId > curId)

When the electionEpoch of other nodes is less than the electionEpoch of the current node, it means that other nodes have missed a round of voting, and their previous voting records are valid, and the voting rounds of other nodes are out of date, so they abandon this outdated voting and do not participate in statistics. (your own vote has been cast, so there is no need to repeat the vote.)

When the electionEpoch of other nodes is equal to the electionEpoch of the current node, it means that both nodes are voting normally, and the corresponding values of zxid, myid and ballot nodes of this node are compared. Which node has a large value indicates that the data recorded by which node is more complete. At this time, the ballot information needs to be updated to cast a vote for the node with more complete data. (indicates that the node data of the previous vote is not the most complete recorded in the cluster, and a new vote is required.)

The Leader election is the node that selects the most complete data, so the election process will not always compete in a cycle and will terminate the cycle when the pipeline record meets more than half of the confirmation. This judgment condition is that after voting for a possible complete node, the voting information for the service node in the cluster is recorded as a service ID. When more than half of the nodes in the cluster vote for the node, it is determined that the node is qualified to become a Leader node, thus ending the Leader election.

The figure is as follows

Data synchronization of Zookeeper

After the Leader node is elected in the Zookeeper cluster, the Leader node takes the LEADING state. The Follower node performs the FOLLOWING status and starts the Leader confirmation and synchronization process. The observer node enters the OBSERVING state and is also ready to enter the synchronization process.

Leader node

The pre-selected Leader node will first establish a network connection with the Learner node and wait for the Follower and Observer nodes to connect.

CnxAcceptor = new LearnerCnxAcceptor (); / / this is a connection thread cnxAcceptor.start ()

This network connection is very important, each Leader will build a LearnerCnxAcceptorHandler processor, the core process is as follows

BufferedInputStream is = new BufferedInputStream (socket.getInputStream ()); LearnerHandler fh = new LearnerHandler (socket, is, Leader.this); fh.start ()

This LearnerHandler handles all network IO interactions between Leader and learner

Message package public class QuorumPacket implements Record {private int type; private long zxid; private byte [] data; private java.util.List authinfo;} 1. Murray-> learner node

After the Follower node enters the FOLLOWING state, it will first look up the Leader address and then establish a network connection

QuorumServer leaderServer = findLeader (); connectToLeader (leaderServer.addr, leaderServer.hostname)

After establishing the connection, send out the message packet of type=FOLLOWERINFO, xid=ZxidUtils.makeZxid (self.getAcceptedEpoch (), 0), data=LearnerInfo

The main function is to send the epoch of the current node (all connected nodes will send INFO information to provide Leader with the largest epoch in the cluster, in order to confirm the latest term of Leader: lastAcceptedEpoch + 1 Leader Leader node will block waiting for the Follower node to connect until the maximum epoch is obtained when more than half of the acknowledgement is satisfied)

Long newEpochZxid = registerWithLeader (Leader.FOLLOWERINFO)

Wait for the Leader node to indicate after the message is sent

2.While-> Leader node

The FOLLOWERINFO message packet sent by Follower will be received in LearnerHandler and converted to: learnerInfoData to get the epoch in the zxid provided by Follower

Long lastAcceptedEpoch = ZxidUtils.getEpochFromZxid (qp.getZxid ())

And then participate in the epoch election of the new Leader.

Long newEpoch = learnerMaster.getEpochToPropose (this.getSid (), lastAcceptedEpoch)

After receiving the epoch provided by more than half of the Follower, Leader sends the final epoch confirmation message packet.

QuorumPacket newEpochPacket = new QuorumPacket (Leader.LEADERINFO, newLeaderZxid, ver, null)

Wait for Follower to confirm the final epoch

LearnerMaster.waitForEpochAck (this.getSid (), ss); 3.Murray-> learner node

After sending the INFO information blocked in the registerWithLeader by Follower, the Leader indicates that the epoch and zxid of the Leader are obtained, and a confirmation message is sent.

QuorumPacket ackNewEpoch = new QuorumPacket (Leader.ACKEPOCH, lastLoggedZxid, epochBytes, null)

After receiving the zxid from Leader, set your status to synchronization state, start the data synchronization process, and wait for the synchronization data of Leader.

Self.setZabState (QuorumPeer.ZabState.SYNCHRONIZATION); syncWithLeader (newEpochZxid); 4.muri-> Leader node

After receiving enough epoch confirmation messages from Follower, Leader sets its own state to data synchronization, starts data synchronization with each Learner, blocks and waits for Learner data synchronization to complete, and then executes its own Leading.

WaitForNewLeaderAck (self.getId (), zk.getZxid ())

Data synchronization occurs in the network connection established by each Leader: LearnerHandler, immediately starts the data synchronization process after confirming that the Learner has received the latest epoch message.

Boolean needSnap = syncFollower (peerLastZxid, learnerMaster)

The synchronization process uses a reentrant read-write lock: ReentrantReadWriteLock. And get the lastProcessedZxid, minCommittedLog, maxCommittedLog of this Leader node

Synchronization types: DIFF: differential synchronization, TRUNC: rollback synchronization, SNAP: full synchronization, mainly TRUNC+DIFF, that is, zxid with more transaction logs on learner than on leader is rolled back, and less received from leader for redo.

Only system variables: when zookeeper.forceSnapshotSync is configured to force full synchronization, full synchronization is enabled using snapshot files.

Whether there are more transaction logs in learner or more transaction logs in leader nodes is compared by the zxid of the two nodes. The transaction log will fall to form a snapshot, and the node will cache a certain number of transaction logs, which have the scope attributes of minCommittedLog and maxCommittedLog. Data in this range can be sent directly, otherwise the snapshot file needs to be read from disk and sent again.

System variable: zookeeper.snapshotSizeFactor configures the percentage of snapshot data read from disk to the total number of snapshots. The default is 0.33 or 1/3.

System variable: zookeeper.commitLogCount configures the number of transaction logs cached by the node. Default is 500.

5.While-> learner node

Follower nodes block data waiting for Leader synchronization in syncWithLeader. There are three main types of Leader synchronization data: DIFF, TRUNC and SNAP. DIFF implements the proposal one by one, and the Follower node will execute it after receiving it. When TRUNC receives the Follower, it can truncate the log files directly to the specified location. For SNAP, empty the dataTree to reverse the sequence according to the transferred log files. After processing, a confirmation message is sent to formally provide services to the outside world. After receiving the synchronization end message of Follower, the Leader node ends the waitForNewLeaderAck and will formally provide services to the outside world.

The mainstream program of synchronous data is shown in the figure.

The details of data synchronization are shown in the figure

Watch Mechanism of Zookeeper

Zookeeper supports the client to listen to the Znode registration Watch of the data node and report back to the client in time when the node changes. Watch mechanism is the basis of Zookeeper distributed lock and external Leader election function.

Watch Typ

Before v3.6.0: watch is registered once, and will be deleted after feedback. If repeated monitoring is required, repeated registration is required.

V3.6.0 +: support configuration listening mode: STANDARD (standard type), PERSISTENT (persistent type), PERSISTENT_RECURSIVE (persistent recursive type). The standard type is the one-time registration type of the previous version; the persistent type is the automatic re-registration type; and the persistent recursive type is the enhanced version of the persistent type, which repeatedly listens not only to the specified node, but also to the child nodes of the node. The latter two modes require active removal of Watch to cancel snooping.

API

API:exist, getData, getChildren, addWatch that support Watch

Remove API:removeWatches from Watch

API:setData, delete, create that trigger Watch

Guarantee

Only clients that have registered for Watch snooping will receive feedback notifications

The order of feedback notifications received by the client is consistent with the processing order of the server node.

Be careful

Watch relies on one-time notification of network connections, which may result in loss of Wacth during client reconnection

Standard Watch is an one-time, multiple-triggered Watch. If the client does not have time to reset, subsequent Watch notifications will be lost.

Standard Watch is one-time, persistent Wach is automatically registered by the server, and it is essentially an one-time notification.

Persistent Watch will always be notified if it is not actively removed.

Steps

The client registers Watch for the specified node

The service saves the Watch and sends a message notification once the event is triggered

The client calls back the watch event for further processing

Principle

Watch snooping registration object is WatchRegistration in Zookeeper, and different API correspond to different registration object.

/ / getData, getConfig-- > DataWatchRegistrationpublic class GetDataRequest implements Record {private String path; private boolean watch;} / / exists-- > ExistsWatchRegistrationpublic class ExistsRequest implements Record {private String path; private boolean watch;} / / getChildren-- > ChildWatchRegistration public class GetChildrenRequest implements Record {private String path; private boolean watch;} / / @ since 3.6.0 addWatch-- > AddWatchRegistrationpublic class AddWatchRequest implements Record {private String path; private int mode;}

Another obvious feature of this request is that watch is a Boolean type, that is, it only identifies whether it is watch

Watch registration

1.Mury-> client: the queuePacket object is built when the API request is initiated. The object mainly includes RequestHeader, ReplyHeader, Record (Request), Record (Response), WatchRegistration and so on. Where RequestHeader marks the operation type type and transaction zxid of this request. The server will have different parsing processes according to the type of operation. In the constructor of queuePacket, queuePacket is put into a blocking queue outgoingQueue, where a dedicated thread, sendThread#clientCnxnSocket#doTransport, reads elements from the queue and processes them. If it is a transaction request, zxid++, is then serialized to RequestHeader and Request, sending only both. The queuePacket will be placed in the pendingQueue to wait for the server to return, which means that the WatchRegistration is not sent, but the operation ID, zxi, path, and watch flags are sent.

2. Zookeeper-> server: when Zookeeper starts, the server connection will be opened through startServerCnxnFactory. At this time, AcceptThread will be opened to handle the client connection, which will be handled here when the client establishes the connection. When connecting, the SelectorThread will be built to handle the client's IO request through handleIO, and then the IO interaction will be abstracted as IOWorkRequest to the thread pool to asynchronously handle doWork-- > doIO. Then the network message packet readPayload will be parsed, and then the message package processPacket will be parsed. After parsing, it is put into the blocking queue submittedRequests, and the queue request is processed in RequestThrottler, and then submitRequestNow gives the processing chain of ZkServer to de-chained processing. FinalRequestProcessor is specified to handle interactions with the database when the processing chain is initialized. There are different parsing processes according to the operation identification type and the corresponding Request is constructed by deserializing the data. If the watch ID is stored in Request, a record is added in DataTree through IWatchManager, which includes the listening path Path and the connection ServerCnxn of the client. This step is server registration, where IWatchManager is divided into two types: current node and child node.

Private IWatchManager dataWatches;private IWatchManager childWatches

When the Request is processed, it is written to the Response, and the ReplyHeader, operation ID, response status stat, etc., are returned.

3. SendThread#readResponse-> client: receives the response from the server in SendThread#readResponse. In this case, the queuePacket object saved in the previous outgoingQueue is fetched from the blocking queue pendingQueue, and the response content is finishPacket. In this case, the client completes the Watch registration and the response corresponding to the API function. FinishPacket receives a queuePacket object, processes the WatchRegistration in it, and calls its register method. Register will first get the storage list of the corresponding type in the client's ZKWatchManager. Where ZKWatchManager stores the following list

Private final Map dataWatches = new HashMap (); private final Map existWatches = new HashMap (); private final Map childWatches = new HashMap (); private final Map persistentWatches = new HashMap (); private final Map persistentRecursiveWatches = new HashMap ()

Different Map stores different types of Watcher,register method is to store an instance of WatchRegistration in HashMap. This is client registration.

The figure is as follows

Watch trigger

1. API-> client: when a node sends NodeCreated,NodeDeleted, NodeDataChanged and NodeChildrenChanged events, Watch feedback will be triggered, corresponding to create, delete, and setData.

2. IWatchManager#triggerWatch-> server: IWatchManager#triggerWatch will be triggered when the above API manipulates DataTree data. Different API triggers different types of events. At this point, ServerCnxn builds a notification type client response and sends it to the client. If WatchMode is the default type, the Watch identity is removed once triggered, otherwise it will continue to fire.

Public static final WatcherMode DEFAULT_WATCHER_MODE = WatcherMode.STANDARD

3. SendThread#readResponse-> client: also receives the response from the server in SendThread#readResponse, and if the parsing is found to be a response of notification type, then build a WatcherEvent, build the event and the WatcherSetEventPair object of the corresponding Watch list through eventThread.queueEvent (we), and put it into the blocking queue waitingEvents. When the WatcherSetEventPair is built, it looks for all client Watch registered in the client ZKWatchManager that are interested in the specified event. EventThread keeps fetching events from waitingEvents, and if it is WatcherSetEventPair, it iterates through the process method of calling Watch, which is the Watch listening processing method that the client needs to re-handle. This completes the whole Watch trigger mechanism.

The figure is as follows

Chroot characteristics of Zookeeper

The Chroot feature has been added to V3.2.0 +, which allows each client to set a namespace for itself. In this way, all operations of the client will generate data isolation under the renamed namespace.

Authority Control of Zookeeper

Zookeeper supports the configuration of ACL permissions for each node, where the add authentication user command is: addauth scheme auth, and the auth format is::

ACL permission scheme

Scheme

Description

Setup command

World

There is only one user: anyone, on behalf of everyone (default)

SetAcl path world:anyobe:acl

Use IP address authentication

SetAcl path ip::acl

Auth

Use user authentication with added authentication

SetAcl path auth::acl

Digest

Authentication using "user name: password"

SetAcl path digest:::acl

ACL permission identification

Authority

ACL abbreviation

Description

CREATE

You can create child nodes (can only be created step by step, not across levels)

DELETE

Child nodes can be deleted (only lower-level nodes)

READ

Can read node data and display a list of child nodes

WRITE

Node data can be set

ADMIN

You can set node access control list permissions

Downtime and dynamic expansion of Zookeeper Cluster

The number of clusters conforms to the 2N+1 principle. When the number of cluster outages is less than 1 / 2, it can still serve the outside world.

V3.5.0+ supports cluster dynamic configuration. Enable the attribute reconfigEnabled=true in the main configuration file, the configuration attribute is: dynamicConfigFile= basePath/fileName, and the dynamic configuration file name format is: configFilename.dynamic [.version]. Only server, group and weight attributes are supported for the content, where the server configuration format is

Server. =: [: role]; [:]

Available values for role role are: participant (participant, running follower, default), observer (observer). After modification, you need to call the client command reconfig.

Zookeeper cluster monitoring

V3.6.0 + supports metrics monitoring, which can be outputted with Prometheus.io through Jetty server. The access address is http://hostname:httPort/metrics. At this point, you can use grafana for visual monitoring.

Thank you for your reading, the above is the content of "what are the necessary knowledge points of the distributed coordination service component Zookeeper". After the study of this article, I believe you have a deeper understanding of what the necessary knowledge points of the distributed coordination service component Zookeeper are, and the specific usage still needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.