Technical Analysis of SmartX products: SMTX distributed Block Storage-- metadata 07/06 Update SLTechnology News&Howtos

Technical Analysis of SmartX products: SMTX distributed Block Storage-- metadata

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Building an enterprise distributed storage system is a challenging task for any team. Not only need a lot of theoretical basis, but also need to have good engineering ability. SmartX has been invested in the field of distributed storage for 5 years and has accumulated a lot of valuable practical experience. We will give you a detailed introduction to how SmartX builds distributed block storage products through a series of articles. This article is the first part, collated from the speech given by Zhang Kai, co-founder and CTO of SmartX, at QCon 2018, focusing on the relevant background and metadata services.

Good afternoon, everyone. The topic I share with you today is "ZBS:SmartX self-developed distributed Block Storage system". SmartX is the name of our company, and I am the co-founder and CTO of this company. ZBS is the name of a distributed block storage product developed by SmartX.

I graduated from the computer Department of Tsinghua University. After graduation, I joined Baidu Infrastructure Department for two years, mainly engaged in distributed systems and big data related work. I am also a code contributor to the open source community and have been involved in projects such as Sheepdog and InfluxDB. Sheepdog is an open source distributed block storage project, and InfluxDB is a temporal database (Time Series Database,TSDB) project. I left Baidu in 2013 and founded SmartX with two brothers from Tsinghua University.

Founded in 2013, SmartX is a technology-led company that currently focuses on distributed storage and virtualization. Our products are developed for ourselves and now run on thousands of physical servers, storing dozens of PB of data. SmartX cooperates with mainstream hardware service providers and cloud service providers in China. Our products have already served key businesses in core areas, including public cloud, private cloud, financial industry, manufacturing and other core areas, including core applications and core databases. Today I will focus on distributed block storage.

Generally speaking, according to the storage access interface and application scenarios, we divide distributed storage into three types, including distributed block storage, distributed file storage, and distributed object storage.

Among them, the main application scenarios of distributed block storage include:

Virtualization: for example, Hypervisor like KVM,VMware,XenServer, and cloud platforms like Openstack,AWS. The role in which the block is stored is the storage that supports the virtual disk in the virtual machine.

Database: such as MySQL,Oracle, etc. Many DBA run database data disks on a shared block storage service, such as distributed block storage. In addition, there are many customers who run the database directly in the virtual machine.

Containers: containers are more and more widely used in enterprises in recent years. Generally speaking, applications running in containers are stateless, but in many application scenarios, applications also have the need for data persistence. The application can choose to persist the data to the database or to a shared virtual disk. This requirement corresponds to Kubernetes, which is the function of Persistent Volume.

Today I will focus on how SmartX builds distributed block storage. Since its establishment in 2013, SmartX has accumulated about 5 years of research and development experience in distributed block storage, so today we will not only share how SmartX implements our own distributed block storage ZBS, but also describe in detail some of our thoughts and choices in the research and development process of distributed block storage. In addition, we will also introduce the future planning of our products.

In a broad sense, there are usually three problems to be solved in distributed storage, namely, metadata service, data storage engine, and consistency protocol.

Among them, the functions provided by metadata services generally include: cluster member management, data addressing, replica allocation, load balancing, heartbeat, garbage collection and so on. The data storage engine is responsible for solving the problem of data storage on a single machine, as well as local disk management, disk failure handling and so on. Each data storage engine is isolated, and a consistency protocol needs to be run between these isolated storage engines to ensure that the access to the data can meet our desired consistency state, such as strong consistency, weak consistency, sequential consistency, linear consistency and so on. According to different application scenarios, we choose a suitable consistency protocol, which will be responsible for the synchronization of data between different nodes.

With these three parts, we basically have a core of distributed storage. The differences between different distributed storage systems basically come from the different choices of these three aspects.

Next, I will introduce how we think about ZBS system design from these three aspects, and finally decide which type of technology and implementation method to adopt.

First, let's introduce metadata services. Let's first talk about our need for metadata services.

The so-called metadata is "data of data", such as where the data is placed, which servers are in the cluster, and so on. If the metadata is lost, or if the metadata service does not work properly, the data of the entire cluster cannot be accessed.

Because of the importance of metadata, the first requirement for metadata is reliability. Metadata must be kept in multiple copies, and metadata services also need to provide Failover capabilities.

The second requirement is high performance. Although we can optimize the IO path so that most IO requests do not need to access the metadata service, there are always some IO requests that need to modify the metadata, such as data allocation, and so on. In order to prevent metadata operations from becoming a bottleneck of system performance, the response time of metadata operations must be short enough. At the same time, due to the continuous expansion of the cluster scale of distributed systems, there are certain requirements for the concurrency of metadata services.

The last requirement is lightweight. Since most of our product usage scenarios are private deployment, that is, our products are deployed in the customer's data center and are operated and maintained by the customer, not by our operation and maintenance staff. This scenario is completely different from that of many Internet companies to operate and maintain their own products. So for ZBS, we put more emphasis on the lightweight of the whole system, especially the metadata service, as well as the ability of easy operation and maintenance. We expect metadata services to be lightweight enough to deploy metadata services and data services together. At the same time, we hope that most of the operation and maintenance operations can be completed automatically by the program, or users only need to do simple operations on the interface. If you know HDFS, the module of metadata service in HDFS is called Namenode, which is a very heavyweight module. Namenode needs to be independently deployed on a physical server, and the requirement for hardware is very high, and it is very difficult for operation and maintenance. Whether it is upgrade or master / slave switching, it is a very heavy operation, and it is very easy to cause failures due to operation problems.

These are our requirements for metadata services. Next, let's take a look at specific ways to construct a metadata service.

When it comes to storing data, especially structured data, the first thing that comes to mind is relational databases, such as MySQL, and some mature KV storage engines, such as LevelDB,RocksDB. But the biggest problem with this type of storage is that it cannot provide reliable data protection and Failover capabilities. Although LevelDB and RocksDB are very lightweight, they can only save data on a single machine. Although MySQL also provides some active and standby solutions, we think that the active and standby scheme of MySQL is too cumbersome and lacks a simple automated operation and maintenance scheme, so it is not a very good choice.

Second, let's look at some distributed databases, such as MongoDB and Cassandra. Both distributed databases can solve the problem of data protection and provide Failover mechanism. But none of them provide ACID mechanism, so it will be troublesome to implement it in the upper layer, and it will require extra work. Secondly, these distributed databases are relatively complex in operation and maintenance, and it is not easy to automate operation and maintenance.

There is also an option to implement your own framework based on the Paxos or Raft protocol. But the cost of this is very high, and it is not a very cost-effective choice for a startup. And we started our business in 2013, when Raft had just proposed it.

The fourth is to choose Zookeeper. Zookeeper is based on ZAB protocol and can provide a stable and reliable distributed storage service. But the biggest problem with Zookeeper is that the data capacity it can store is very limited. In order to improve the access speed, Zookeeper caches all the data stored in memory, so the scale of data that the metadata service can support is seriously limited by the memory capacity of the server, so that the metadata service can not be lightweight or mixed with the data service.

Finally, there is an approach based on Distributed Hash Table (DHT). The advantage of this method is that the location of the copy of the data does not need to be saved in the metadata, but is calculated according to the consistent hash, which greatly reduces the storage pressure and access pressure of the metadata service. However, the problems existing in the use of DHT result in the loss of control over the location of data copies. In the actual production environment, it is very easy to cause data imbalance in the cluster. At the same time, in the process of operation and maintenance, if you need to add nodes, remove nodes, add disks, and remove disks, some data will need to be redistributed due to changes in the hash ring, resulting in unnecessary data migration in the cluster. and the amount of data is often very large. This kind of operation and maintenance occurs almost every day in a relatively large-scale environment. Large-scale data migration can easily affect the performance of online business, so DHT makes the operation and maintenance operation very troublesome.

The methods described above have a variety of problems and can not be used directly. In the end, ZBS chose to use a combination of LevelDB (which can also be replaced by RocksDB) and Zookeeper to solve the problem of metadata services. First, both services are relatively lightweight; second, LevelDB and Zookeeper usage is also very stable in production.

We have adopted a mechanism called Log Replication that can take advantage of both LevelDB and Zookeeper while avoiding their own problems.

Here we have a brief introduction to Log Replication. Simply put, we can think of data or states as a set of historical operations on the data, and each operation can be recorded by being serialized into a Log. If we can get all the Log and repeat the operations recorded in the Log, then we can fully restore the state of the data. Any program that has Log can recover data by replaying Log. If we replicate the Log, we actually replicate the data. This is the most basic idea of Log Replication.

Let's take a look at how ZBS uses Zookeeper + LevelDB to complete Log Replication operations. First of all, there are many Meta Server in the cluster, and each Server runs a LevelDB database locally. Meta Server selects the master through Zookeeper, selects one Leader node to respond to metadata requests, and the other Meta Server enters the Standby state.

When the Leader node receives the update operation of the metadata, it serializes the operation into a set of operation logs and writes the set of logs to Zookeeper. Because Zookeeper is multiple copies, once the Log data is written to Zookeeper, it means that the Log data is secure. At the same time, this process also completes the replication of Log.

When the log is successfully committed, Meta Server can submit changes to the metadata to the local LevelDB at the same time. What is stored in LevelDB here is a full amount of data, and does not need to be stored in the form of Log.

For non-Leader Meta Server nodes, the Log will be pulled asynchronously from the Zookeeper, and the Log will be deserialized into operations on metadata, and these modifications will be submitted to the local LevelDB. This ensures that each Meta Server can hold a complete metadata.

As mentioned earlier, the capacity of Zookeeper to store data is limited by memory capacity. To prevent Zookeeper from consuming too much memory, we periodically clean up the Log in Zookeeper. As long as the Log has been synchronized by all the Meta Server, the Log saved in the Zookeeper can be deleted to save space. Usually we only save 1GB's Log on Zookeeper, which is enough to support metadata services.

The logic of Failover is also very simple. If the Leader node fails, the other surviving Meta Server will re-elect the host through Zookeeper to select a new Meta Leader. The new Leader will first synchronize all unconsumed logs from the Zookeeper, submit them to the local LevelDB, and then provide metadata services.

Now let's summarize the characteristics of the metadata service implementation in ZBS.

First of all, this principle is very easy to understand and very easy to implement. Zookeeper is responsible for selecting the master and Log Replication, and LevelDB is responsible for storing local metadata. The logic behind it is to split the logic as much as possible and reuse the implementation of existing projects as much as possible.

Second, the speed is fast enough. Both Zookeeper and LevelDB themselves perform well, and in production, we run Zookeeper and LevelDB on SSD. In the actual test, the modification of single-dimensional metadata is completed in milliseconds. In the concurrency scenario, we can Batch the logs modified by metadata to improve the concurrency ability.

In addition, this approach supports Failover, and Failover is also very fast. The time of Failover is the time of selecting master plus Log synchronization, and the metadata service can be restored in seconds.

Finally, I would like to talk about deployment. When deploying online, we usually deploy 3 or 5 instances of Zookeeper services and at least 3 instances of Meta Server services to meet the requirements of metadata reliability. Metadata services consume very little resources and can be deployed mixed with other services.

These are some basic principles. Let's take a look at the specific implementation of metadata services within ZBS.

We encapsulate the above Log Replication logic in a Log Replication Engine, which includes operations such as selecting master, submitting Log to Zookeeper, synchronizing data to LevelDB, and so on, which further simplifies the complexity of development.

On the basis of Log Replication Engine, we implement the logic of the whole Meta Sever, which includes many management modules such as Chunk Manager,NFS Manger,iSCSI Manager,Extent Manager, which can manage specific parts of metadata through Log Replication Engine. The RPC module is the interface exposed by the Meta Server, which is responsible for receiving external commands and forwarding them to the corresponding Manager. Such as creating / deleting files, creating / deleting virtual volumes, and so on. In addition, Meta Server also contains a very complex scheduler module, which contains a variety of complex allocation strategies, recovery strategies, load balancing strategies, as well as heartbeat, garbage collection and other functions.

The above is the introduction of SMTX distributed block storage metadata service in SmartX superfusion system.

For more information, you can visit the SmartX website: https://www.smartx.com

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.