Zeppelin: an overview of a distributed KV storage platform 05/11 Update SLTechnology News&Howtos

Zeppelin: an overview of a distributed KV storage platform

2025-05-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Over the past year, most of the work has revolved around Zeppelin, which has gone through Zeppelin from scratch to gradual improvement and stability. While witnessing Zeppelin's growth, Zeppelin also witnessed my accumulation and progress. To me, Zeppelin was like friends who grew up together as children, exchanging perceptions of unknown worlds, colliding visions of the future, and then portraying each other better in countless games and conversations. In this blog post I would like to introduce you to my old friend.

positioning

Zeppelin is a distributed KV storage platform. At the beginning of its design, we had several main expectations for it:

High performance;

Large clusters, hence the need for better scalability and necessary business isolation and quotas;

As a supporting platform, support richer protocols upward;

The whole design and implementation of Zeppelin revolves around these three goals. This article will introduce them from API, data distribution, metadata management, consistency, replica policy, data storage and fault detection.

API

To give readers an overall impression of Zeppelin, let's start with the interfaces it provides:

Basic KV storage-related interfaces: Set, Get, Delete;

TTL support;

HashTag and Batch operations for the same HashTag, Batch guarantee atoms, this support is mainly to support the upper layer of richer protocols.

data distribution

As a distributed storage, the first thing to solve is the problem of data distribution. Another blog about distributed storage system data distribution methods introduced possible data distribution schemes, Zeppelin chose a more flexible way of fragmentation, as shown in the following figure:

The logical concept of Table is used to distinguish services, and the entire Key Space of Table is divided into partitions of the same size. The multiple copies of each partition are stored on different storage nodes (Node Servers), so each Node Server will carry different copies of multiple partitions. The number of Partitions is determined when the Table is created. More Partitions will bring better data balance effect and provide the possibility of expanding to larger clusters, but it will also bring the pressure of meta-information inflation. In implementation, Partition is the smallest unit of data backup, data migration, and data synchronization, so more Partitions may bring more resource pressure. Zeppelin's design implementation also minimizes this effect.

It can be seen that the fragmentation method splits the data distribution problem into two layers of implicit projection: the mapping from Key to Partition can be simply implemented with Hash. However, the mapping from Partition replica to storage node is relatively complex, and stability, balance, node heterogeneity and fault domain isolation need to be considered (for more discussion, see Discussion on Data Distribution Method of Distributed Storage System). For this mapping, Zeppelin's implementation refers to CRUSH's hierarchical maintenance of replica fault domains, but abandons CRUSH's slightly paranoid pursuit of reducing meta-information.

When performing cluster change operations such as creating Table, expanding capacity, and shrinking capacity, users need to provide the entire:

Topology information of hierarchical cluster deployment (including deployment information of racks and machines of nodes);

storing node weights;

Distribution rules for each fault level;

Zeppelin uses this information and the current data distribution to directly calculate the complete target data distribution, which will try to ensure data balance and the required replica failure domain. The following illustration illustrates the rules and distribution of replicas isolated at the cabinet level. Decentralized Placement of Replicated Data

meta information management

As can be seen from the above, meta-information including the distribution of each fragment replica needs to be shared among the entire cluster and spread in time when it changes, which involves the problem of meta-information management. Generally, there are two ways:

Central meta-information management: The central node is responsible for the detection, update and maintenance of the entire cluster meta-information. The advantages of this method are simple and clear design, easy implementation, and relatively small amount of meta-information dissemination and timeliness. The biggest drawback is a single point of failure at the central node. BigTable and Ceph are examples.

Peer-to-peer meta-information management: Distribute the processing burden of cluster meta-information to all nodes in the cluster, and the status of nodes is consistent. When meta-information changes, protocols such as Gossip need to be used to propagate, which limits the cluster size. Its main advantages are no single point failure and better horizontal expansion ability. Dynamo and Redis Cluster use this approach.

Considering the need for large cluster targets, Zeppelin adopts a meta-information management approach with central nodes. Its overall structure is shown in the figure below:

Zeppelin has three main roles: Meta Server, Node Server, and Client. Meta is responsible for meta information maintenance, node survival detection and meta information distribution;Node is responsible for actual data storage;Client's first access needs to obtain complete data distribution information of the current cluster from Meta, calculate the correct Node location for each user request, and initiate a direct request.

To alleviate the single-point problem of the central node mentioned above. We adopted the following strategy:

Meta Server provides services in clusters, with consistency algorithms to ensure data accuracy.

Good Meta design: including delayed submission of consistent data; sharing read requests with Followers through Lease; coarse-grained distributed lock implementation; reasonable persistence and temporary data partitioning. For a more detailed introduction, see: Zeppelin is not an airship meta-information node

Smart Client: Client takes on more responsibilities, such as caching meta-information; maintaining links to Node Servers; and calculating initial and changing data distributions.

Node Server shares more responsibilities: for example, meta-information update is initiated by storage node; client request redirection when meta-information changes is realized through MOVE, WAIT and other information, so as to reduce Meta pressure. For a more detailed introduction, see Zeppelin is not a storage node for airships.

Through the above several aspects of strategy design, as far as possible to reduce the dependence on the central node. Even if the Meta cluster is completely abnormal, existing client requests can still proceed normally.

consistency

As mentioned above, the central meta-information Meta nodes perform services in a clustered manner. This requires consistency algorithms to ensure that:

Even if a network partition or node exception occurs, the entire cluster can still provide consistent service as a stand-alone, that is, the next successful operation can see all previous successful operations complete in sequence.

Zeppelin uses our consistency library Floyd, a C++ implementation of Raft, to accomplish this goal. For more information, see Raft and its three subproblems.

Meta cluster needs to complete the tasks of node survival detection, meta-information update and meta-information diffusion. Note here that since the consistency algorithm has relatively low performance, we need to control the data written to the consistency library and only write important, difficult to recover, and less frequently modified data.

replica policy

For fault tolerance, we usually adopt the method of three copies of data, and because of the positioning of high performance, we choose the Master, Slave copy strategy. Each Partition contains at least three copies, one of which is Master and the rest are Slaves. All user requests are handled by the Master copy, and the scenario of read/write separation allows Slave to provide read services as well. Write requests handled by Master write Binlog after DB modification and asynchronously synchronize Binlog to Slave.

The above figure shows the process of establishing a master-slave relationship between Master and Slave, and the Slave on the right. When meta information changes, Node pulls the latest meta information from Meta, and when it finds that it is a new Slave of a Partition, it hands TrySync task to TrySync Moudle through Buffer;TrySync Moudle initiates TrySync to Master's Command Module;Master generates Binlog Send task to Send Task Pool;Binlog Send Module sends Binlog to Slave to complete asynchronous data replication. For more details, see Zeppelin is not a storage node for airships. In the future, we will also consider supporting Quorum and EC copy methods to meet different usage scenarios.

data storage

Node Server eventually needs to complete data storage and query operations. Zeppelin currently uses Rocksdb as its storage engine, and each Partition copy has a separate instance of Rocksdb. The LSM scheme is also adopted for the pursuit of high performance. Compared with B+Tree, LSM greatly improves the write performance by converting random writes to sequential writes, and ensures relatively good read performance through memory cache. In the overview of solving LevelDB, LevelDB is taken as an example to introduce the design and implementation of LSM.

However, in scenarios with large data values, the LSM write amplification problem is severe. For high performance, Zeppelin mostly uses SSD disk. The gap between random write and sequential write of SSD is not as big as that of mechanical disk. At the same time, SSD has the problem of erasure life, so it is not cost-effective for LSM to exchange high performance advantages through multiple repeated writes. Zeppelin needs to support different protocols on the upper layer, and inevitably there will be large Value. LSM upon SSD has made more discussions on this aspect, including other storage engines for different scenarios and pluggable designs.

fault detection

A good fault detection mechanism should be able to do the following:

Timely: When node anomalies occur, such as downtime or network outages, the cluster can be perceived within an acceptable time range;

Appropriate pressure: including pressure on nodes and pressure on networks;

Tolerance of network jitter

Diffusion mechanism: meta-information changes caused by node survival state changes need to spread to the whole cluster through some mechanism;

Faults in Zeppelin may occur in meta-node clusters or storage node clusters, failure detection of meta-node clusters relies on Floyd's Raft implementation at lower levels, and jitter tolerance at upper levels through Jeopardy stages. For more details see: Zeppelin is not an airship meta-information node.

The meta-information node is responsible for fault detection of storage nodes. After sensing abnormality, the meta-information node cluster modifies meta-information, updates meta-information version number, and notifies all storage nodes through heartbeat. After the storage node finds that meta-information changes, it actively pulls out the latest meta-information and makes corresponding changes.

Finally, Zeppelin provides a wealth of operational, monitoring data, and tools. Easy to monitor displays with tools such as Prometheus.

relevant

[Zeppelin](https://github.com/Qihoo360/zeppelin)

[Floyd](https://github.com/Qihoo360/floyd)

[Raft](https://raft.github.io/)

[Discussion on Data Distribution Method of Distributed Storage System](http://catkang.github.io/2017/12/17/data-placement.html)

[Decentralized Placement of Replicated Data](https://whoiami.github.io/DPRD)

[Zeppelin is not an airship meta-info node](http://catkang.github.io/2018/01/19/zeppelin-meta.html)

[Zeppelin is not a storage node of airship](http://catkang.github.io/2018/01/07/zeppelin-overview.html)

[Raft and its three subproblems](http://catkang.github.io/2017/06/30/raft-subproblem.html)

[Overview of LevelDB Solution](http://catkang.github.io/2017/01/07/leveldb-summary.html)

[LSM upon SSD](http://catkang.github.io/2017/04/30/lsm-upon-ssd.html）

Original link: mp.weixin.qq.com/s/cfMtQ1YAZiCId3OM7bxXrg

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.