Zeppelin: a meta-information node of a distributed KV storage platform 07/12 Update SLTechnology News&Howtos

Zeppelin: a meta-information node of a distributed KV storage platform

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

From the introduction of Zeppelin: an overview of a distributed KV storage platform, we know that the meta-information node Meta provides meta-information maintenance and provision services to the entire Zeppelin in the form of clusters. It can be said that the Meta cluster is the brain of Zeppelin and the initiator of all meta-information changes. Each Meta node contains an instance of Floyd, which is also a node of Floyd, and the Meta cluster relies on Floyd to provide consistent content reading and writing. This paper will introduce in detail from the aspects of role, thread model, data structure, selecting master and distributed lock, cluster capacity expansion and member change, and finally summarize the inspiration brought by the design and development of Meta nodes.

Role

From the figure above, you can see the central position of the Meta cluster:

Provide current meta-information to Client and Node Server, including shard copy information, Meta cluster membership information, etc.

Keep the heartbeat detection with Node Server, and cut off the main body when you find abnormal.

Accept and execute operation and maintenance commands to complete the corresponding meta-information changes, including capacity expansion, capacity reduction, creation of Table, deletion of Table, etc.

Thread model

Compared with the storage node, the thread model of the meta-information node is relatively simple:

Dispatch thread and Worker thread that process the request

Modify the update thread of Floyd, and the Update thread is the only Floyd modifier. All meta-information modification requirements are transferred to the Update thread through the task queue. At the same time, in order to reduce the writing pressure of Floyd, delayed batch commit is adopted here.

The Condition thread is used to wait for Offset conditions. Some meta-information modification operations, such as SetMaster, expansion and reduction, can only be performed when the master-slave Binlog Offset of the multipart copy is caught up. Meta gets the Offset information from the heartbeat between the Node and the Condition thread constantly checks the Offset gap between the master and slave. Only when the meta-information is completed, the Update thread is informed to complete the corresponding modification.

Cron threads perform scheduled tasks, including checking and completing Meta master-slave switching, checking Node survival, Follower Meta loading current meta-information, performing data migration tasks, and so on.

Data structure

In order to accomplish the above tasks, Meta nodes need to maintain a complete set of data, including Node node heartbeat information, Node node Offset information, shard information, Meta membership information, expansion and migration information and so on. Due to the limitations of the consistency algorithm itself, we need to minimize the pressure of access to Floyd, so not all of this data needs to be maintained directly in Floyd. Zeppelin divides the data according to its importance, access frequency and recoverability, and only records the data that are accessed at low frequency and are not easy to recover in Floyd.

The figure above shows the data structure and storage method of the data maintained by the Meta node. You can see that in addition to the data recorded in the consistency library Floyd, Meta also maintains the corresponding data structure in memory. The in-memory data structure relies on the data in Floyd, which is reorganized and provides a more accessible interface. From the point of view of the completed tasks, it mainly includes three parts:

1. Maintain and provide cluster meta-information (Zeppelin Meta Info)

The corresponding in-memory data structure InfoStore,InfoStore relies on the data in Floyd, including:

The version number of the current meta-information is Epoch. Each change of the meta-information will add one to the Epoch.

Distribution of data fragment copy and Master-Slave Information Tables

Storage node address and survival information Nodes

Meta cluster's own member information Members

InfoStore reorganizes the data to provide convenient query and modification interfaces. In addition, InfoStore maintains some frequently modified but recoverable data:

Last heartbeat time of the storage node: lost after downtime, which can be recovered through the Nodes information in Floyd and the current time of recovery. Note that the current time of recovery is equivalent to the survival of the extended storage node.

Storage node's shard Binlog offset information: Meta needs this information to determine the master-slave switching of the replica, which can be obtained from Node's heartbeat after downtime recovery, which requires Node to carry full Binlog offset information in the first packet after re-establishing the heartbeat connection.

2. Migration information related to capacity expansion and reduction (Epend Shrink)

Corresponding to the in-memory data structure MigrateRegister, it is responsible for registering and providing the migration process. This part will be described in detail in the chapter of cluster expansion and reduction later.

3. Achieve high availability of Meta cluster (Meta High Available)

Meta provides services in the form of clusters, Leader nodes complete write operations, Follower shares the read pressure, and nodes rely on Floyd to ensure consistency, so as to achieve high availability of Meta clusters. The in-memory data structure Election is responsible for the selection of abnormal nodes and relies on the Lock interface provided by Floyd and the Election-related data in it. This section will be described in detail later in the main selection and distributed locking chapter.

Selecting Master and distributed Lock

The nodes in the Meta cluster are divided into two roles: Leader and Follower:

All write operations and heartbeats are redirected to Leader,Leader to encapsulate requests that need to modify Floyd into Task, join the waiting queue, write Floyd with batch delay, and update local memory data structures.

Follower periodically checks the meta-information in Floyd, loads and modifies the local memory data structure if it changes, and provides the query operation of meta-information to the outside.

Therefore, we need a mechanism to select the host, and each Leader needs a fixed lease time, during which the cluster will not choose other Meta nodes as the new Leader, which is equivalent to sacrificing a certain degree of availability to optimize read performance. The master selection problem is a typical application of distributed lock, and the node that obtains the distributed lock is the main one. We believe that distributed locks are made up of three independent layers of problems, as shown on the left side of the following figure, from bottom to top are Consistency, lock implementation (Lock) and lock use (Usage of Lock). Consistency is for high availability, Lock provides a mutually exclusive locking mechanism, and the Usage of Lock part usually has two implementations:

Rigorous implementation: Sequence is returned when the lock is added. This Sequence is self-incremented, and the recipients of all operations after the node that acquires the lock must check the Sequence to ensure that the operation is protected by the lock.

Simple implementation: the node needs to provide a time when it tries to add the lock, and the lock service ensures that the lock will not be given to other nodes during this time. Users need to ensure that all operations can be completed within this time. This method is not rigorous but very easy to use, and Zeppelin's Meta cluster uses this approach.

For example, the right part of the above figure shows the corresponding relationship of these three parts in Meta:

Consistency We rely on Raft implemented by Floyd, while Raft provides fine-grained locking interfaces as well as Set and Get interfaces for storing data.

Relying on the interface provided by Raft, Meta implements its own coarse-grained lock Coarse-Lock. To put it simply, it stores or queries the address information of the current Leader and the last refresh time through the Set Get interface; and protects mutually exclusive access through Floyd's fine-grained lock; Leader refreshes its own time regularly, and Follower discovers the Leader timeout instead. The Coarse-Lock layer implements the Election required for Meta cluster locks.

Use Coarse-Lock,Meta to achieve their own high availability. The current node check is constantly triggered in the Cron thread and attempts to select the master when needed.

It should be noted here that Coarse-Lock takes longer to lock than Fine-Lock, and the migration of response locks is also more expensive. Such as the re-establishment of master-slave links, the discarding and emptying of task queues, the switching of Meta worker threads, and so on. Therefore, we hope that the lower-level Lock jitter will not affect the master-slave relationship of the upper layer as much as possible. In view of this, the following two mechanisms are designed in Meta:

Meta master-slave relationship is decoupled from Floyd master-slave relationship. Even if the Floyd master-slave relationship changes, it is still possible to be transparent to the upper Meta cluster.

The Jeopardy phase is introduced. During the normal operation, Meta will record the current Leader information. When the Floyd is unable to serve due to network or node anomalies, the Meta layer will enter the Jeopardy phase. Jeopardy enables the Meta node to maintain the master-slave relationship for a certain period of time. This time is the Lease optimized to Leader for reading mentioned above. The reason why this can be done is that Zeppline is designed to minimize the dependence on the Meta cluster as the central node, thus accepting the short-term unavailability of the Meta cluster.

Expansion and reduction of cluster capacity

Zeppelin as a storage cluster, considering the resource utilization and business changes, it is inevitable that the cluster needs to be expanded, reduced or balanced. The following figure is a simple example of a capacity expansion operation. Three storage nodes, Node1,Node2,Node3, are responsible for slicing the nine master-slave copies of P1Magi P2Tech P3. When Node4 joins, you need to migrate the Master copy of P1 and the Slave copy of P3.

In view of this kind of problem, the main demands are as follows:

The duration may be very long and there is no manual intervention in the process.

Ensure that the data is correct

Reduce the awareness of online services

Does not significantly increase the Meta burden, including resource usage and code complexity

Meta node exception or network exception can be recovered from the breakpoint

Tolerate Node state changes

It is convenient to pause and cancel, and you can get the status and current progress.

Load balancing

Sub-problem

In order to solve this problem well, we advanced the derivation and cutting of the subproblem:

Capacity expansion, reduction and balance all move multiple shards from the source node to the destination node

Migrating a shard can be divided into three steps: adding a new Slave shard, waiting for data synchronization, switching and deleting the original shard.

Scheme

The figure above shows the expansion, reduction and balancing mechanism of Zeppelin:

The client command line tool transforms Expand, Shrink and Balance operations into initial state (Init State) and a set of DIFF sets through Zeppelin's data distribution algorithm DPRD. Each DIFF item specifies a shard copy and its source and destination nodes to be migrated.

The Init State and DIFF collections are passed to the Migrate Registor module of the Meta Leader node, check the Init State and write the DIFF collection to the Floyd

The Cron thread regularly acquires a certain amount of DIFF items and executes each DIFF item sequentially

Generate and add a new UpdateTask1 from the copy to the Update thread to execute as soon as possible, and set the state to write or block the fragment.

The generated ConditionTask is given to the definition thread, and the ConditionTask includes an Offset condition and a UpdateTask2 for switching copies. This Offset condition usually means that the offsets of the source and destination nodes are the same.

After waiting for the Offset condition to be met, the Condition thread hands the corresponding UpdateTask2 to the Update thread for execution as soon as possible.

After completing the necessary state changes, delete the corresponding DIFF entry from the Register and continue to fetch the new DIFF execution until all are complete. In this way, whenever the Meta node goes down, the new Leader can get the DIFF from the Floyd and continue to operate.

Member change

Usually a Meta cluster is a consistent cluster of 3 or 5 nodes, and sometimes we need to change the members of the Meta cluster, add, delete, or replace. The membership change of the Meta cluster depends on the Membership Change mode provided by the lower-level Floyd,Floyd, which changes one node at a time. For more information, please see CONSENSUS: BRIDGING THEORY AND PRACTICE.

After the membership of Floyd changes, the Meta cluster modifies its in-memory data structure Membership and adds one meta-information Epoch, so that the Node and Client of the whole cluster can learn the new Membership from the newly pulled meta-information.

Lessons We Learn1, decentralized responsibility

Decentralize the responsibility of the central node to a large number of Node and Client, thereby reducing the dependence on the central node:

Heartbeat: initiated by Node and checked for link status

Meta-information provided: Node and Client are found to actively pull away rather than being distributed by Meta. At the same time, when Node visits the error node, it returns a kMove message to Client to help Client get some of the required meta-information data without going through Meta.

2. When considering scalability, reducing the number of communications is sometimes more effective than optimizing the processing speed of a single request. 3. Limit the linear growth of resources.

For example, by batch delaying the submission of Flody, we merge the modification operations within a period of time into a visit to Floyd. Change the request level modification frequency to a constant number of times per second.

4. Data recovery during fault recovery

Similarly, in order to reduce the pressure on Floyd, we will not store all the data in Floyd, so some data maintained only in memory need to be recovered when the service is restored, and the data sources for recovery can include:

Persist data (as little as possible)

Offset information carried in external requests, such as Node, is recovered from the heartbeat.

Estimates, such as Node survival time in Meta, are directly based on the current time of recovery

5, no retry to reenter

All operations in Meta are reentrant without retry, which means that the failure of all steps is not retried directly, but is discarded after some state is restored, and is completed by external secondary operations, which requires that all operations are reentrant. The benefits of doing so are as follows:

It is clear and simple to handle, and you can discard the task directly where something goes wrong.

The upper layer better estimates how long the operation will take to complete, thus providing a time guarantee to the distributed lock or Node.

Referenc

Https://github.com/Qihoo360/zeppelin

Https://github.com/PikaLabs/floyd

Https://ramcloud.stanford.edu/~ongaro/thesis.pdf

Http://baotiao.github.io/2017/09/12/distributed-lock/

Http://catkang.github.io/2018/01/07/zeppelin-overview.html

Https://whoiami.github.io/DPRD

Original link: https://mp.weixin.qq.com/s/eCDWqgRG-FmWFkK6cgzm4w

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.