Practical information | how to evaluate Kubernetes persistent storage scheme 04/16 Update SLTechnology News&Howtos

Practical information | how to evaluate Kubernetes persistent storage scheme

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

In the Garnter technology maturity curve in 2018, container storage appeared in the technology trigger period and has begun to enter the public view. I believe that in the next two years, container storage will become more and more important as Kubernetes becomes more mature and commercialized. How to choose a suitable one among a wide variety of storage products will be a problem that IT bosses have to face. This sharing will analyze how to evaluate the container storage scheme from the perspective of usage scenarios. A variety of storage concepts

From the user's point of view, storage is a disk or a directory, users do not care about how to implement the disk or directory, the user requirements are very "simple", that is, stability and good performance. In order to provide stable and reliable storage products, various manufacturers have introduced a variety of storage technologies and concepts. In order to give you an overall understanding, this article first introduces these concepts in storage.

From the point of view of storage media, storage media can be divided into mechanical hard disk and solid state hard disk (SSD). Mechanical hard disk generally refers to the disk devices addressed by magnetic head, including SATA hard disk and SAS hard disk. Due to the use of head addressing, the performance of the mechanical hard disk is general, the random IOPS is generally about 200, and the sequential bandwidth is about 150MB/s. Solid state disk (SSD) is a device composed of Flash/DRAM chip and controller, which is divided into SATA SSD,SAS SSD,PCIe SSD and NVMe SSD according to different protocols.

From the point of view of product definition, storage is divided into four categories: local storage (DAS), network storage (NAS), storage local area network (SAN) and software defined storage (SDS).

DAS is the local site, directly plugged into the server.

NAS refers to the NAS device that provides NFS protocol, usually using disk array + protocol gateway.

SAN is similar to NAS in that it provides SCSI/iSCSI protocol, and the back end is disk array.

SDS is a general term, including distributed NAS (parallel file system), ServerSAN, etc.

From the perspective of application scene, storage can be divided into three categories: file storage (Posix/MPI), block storage (iSCSI/Qemu) and object storage (S3/Swift).

How does Kubernetes define and classify storage? The concepts related to storage in Kubernetes are PersistentVolume (PV) and PersistentVolumeClaim (PVC), and PV is divided into static PV and dynamic PV. The static PV method is as follows:

Dynamic PV needs to introduce the concept of StorageClass, which is used in the following ways:

The community enumerates the in-tree Plugin of PersistentVolume, as shown in the following figure. As you can see from the figure, Kubernetes divides storage into three categories by access mode, RWO/ROX/RWX. This classification confuses the original storage concepts, including storage protocols, storage open source products, storage commercial products, public cloud storage products, and so on.

How do you map taxonomy in Kubernetes to familiar storage concepts? This paper chooses to compare it with the application scenario.

Block storage usually only supports RWO, such as AWSElasticBlockStore,AzureDisk, and some products can support ROX, such as GCEPersistentDisk,RBD,ScaleIO.

File storage (distributed file system) supports three modes of RWO/ROX/RWX, such as CephFS,GlusterFS and AzureFile

Object storage does not need PV/PVC to abstract resources, and applications can access and use it directly.

Here we have to complain about the Kubernetes community's early abstraction of the storage layer, in one word-chaos, including both open source projects and commercial projects. Now the community is aware of the problem and has designed a unified storage interface layer-Flexvolume/CSI. At present, CSI will be the mainstream of Kubernetes, making a complete storage abstraction layer.

A variety of application scenarios

After introducing the concept of storage, the choice of which kind of storage is still up in the air. At this time, ask yourself a question, what is the type of business? To choose the right storage, you must be aware of the storage needs of your business. This article collates the scenarios and characteristics of using container storage.

Configuration

No matter cluster configuration information or application configuration information, it is characterized by concurrent access, that is, ROX/RWX mentioned earlier, which can access the same configuration file in different clusters or different nodes, and distributed file storage is the best choice.

Journal

In container scenarios, logs are a very important part of the content, which is characterized by high throughput and may generate a large number of small files. If there is a log analysis scenario, there will be a large number of concurrent reads. Distributed file storage is the best choice.

Application (database / message queue / big data)

Applications such as Kafka,MySQL,Cassandra,PostgreSQL,ElasticSearch,HDFS have the ability to store data, and the requirements for underlying storage are high IOPS and low latency. It is better to have data redundancy mechanism in the underlying storage, so that the upper application can avoid complex failure and recovery processing. Take HDFS as an example. When a datanode node goes offline, the original logic will choose to start a new datanode, trigger the recovery logic, and complete the data copy completion. This will take a long time and have a greater impact on the business. If the underlying storage has a replica mechanism, the HDFS cluster can be set to a single replica. After the datanode node is offline, the new datanode is started and the original pv is mounted. The cluster returns to normal, and the impact on the business is shortened to seconds. High-performance distributed file storage and high-performance distributed block storage are the best choices.

Backup

The backup of application data or database is characterized by high throughput, large amount of data and low cost. File storage and object storage are optimal.

Comprehensive application scenarios, high-performance file storage is the best choice.

Storage products with shape × × ×

There are many kinds of storage products on the market, but for container scenarios, they mainly focus on four schemes: distributed file storage, distributed block storage, Local-Disk and traditional NAS.

Distributed block storage includes Ceph,Sheepdog of the open source community, vSAN of Scale IO,Vmware of EMC in commercial products, and so on. Distributed block storage is not suitable for container scenarios, the key problem is the lack of RWX features.

Distributed file storage includes Glusterfs,Cephfs,Lustre,Moosefs,Lizardfs of the open source community, GPFS of isilon,IBM of EMC in commercial products, and so on. Distributed file storage is suitable for container scenarios, but the performance problems are more prominent, mainly focused on GlusterFS,CephFS,MooseFS/LizardFS.

Here is a simple comparison of the advantages and disadvantages of open source projects, for reference only.

The Local-Disk scheme has obvious shortcomings, especially for the application of database and big data. After the node failure, the data recovery time is long, and the impact on the business is wide.

Traditional NAS is also a kind of file storage, but the protocol gateway (head) is the performance bottleneck, traditional NAS has been unable to keep up with the trend of the times.

Classified evaluation strategy

The core requirements of storage are stability, reliability, and availability. Whether it is open source storage projects or commercial storage products, the evaluation method is universal. This article will introduce common evaluation items and evaluation methods.

Data reliability

Data reliability refers to the probability that data is not lost. In general, the storage product will give several 9s of data reliability, or the maximum number of allowable failure disks / nodes. The method of evaluation is to extract the disk by violence. For example, the storage provides a policy of 3 copies, and any two disks are unplugged. As long as the data is not damaged, it means that the reliability is fine. Different data redundancy strategies are used in storage, which provide different reliability.

Data availability

Data availability and data reliability are easily confused. Availability refers to whether the data is online. For example, when the storage cluster is powered off, the data is not online during this period, but the data is not lost. After the cluster returns to normal, the data can be accessed normally. The main way to assess availability is to unplug the server, and then check to see if the stored deployment components have a single point of failure.

Data consistency

Data consistency is one of the most difficult to evaluate, because most scenario users don't know what data the program wrote and where. How to evaluate data consistency? Ordinary testing tools can use fio to enable crc verification option, the best testing tool is the database. If there is a data inconsistency, the database either cannot get up, or the table data is incorrect. Specific test cases need to be carefully considered.

Storage performance

The performance testing of storage is very particular, and the emphasis of block storage and file storage is not the same.

Block storage

Fio/iozone is two typical testing tools, focusing on testing IOPS, latency and bandwidth. Take fio as an example, the test command is as follows:

Fio-filename=/dev/sdc-iodepth=$ {iodepth}-direct=1-bs=$ {bs}-size=100%-- rw=$ {iotype}-thread-time_based-runtime=600-ioengine=$ {ioengine}-group_reporting-name=fioTest

Focus on several main parameters: iodepth,bs,rw and ioengine.

Test IOPS,iodepth=32/64/128,bs=4k/8k,rw=randread/randwrite,ioengine=libaio

Test delay, iodepth=1,bs=4k/8k,rw=randread/randwrite,ioengine=sync

Test bandwidth, iodepth=32/64/128,bs=512k/1m,rw=read/write,ioengine=libaio

file store

Fio/vdbench/mdtest is a common tool for testing file systems. Fio/vdbench is used to evaluate IOPS, latency and bandwidth, and mdtest to evaluate file system metadata performance. Take fio and mdtest as examples, the test commands are as follows:

Fio-filename=/mnt/yrfs/fio.test-iodepth=1-direct=1-bs=$ {bs}-size=500G-- rw=$ {iotype}-numjobs=$ {numjobs}-time_based-runtime=600-ioengine=sync-group_reporting-name=fioTest

There is a big difference between the test parameters of block storage and block storage, that is, ioengine uses sync, and iodepth is replaced by numjobs.

Test IOPS,bs=4k/8k,rw=randread/randwrite,numjobs=32/64

Test delay, bs=4k/8k,rw=randread/randwrite,numjobs=1

Test bandwidth, bs=512k/1m,rw=read/write,numjobs=32/64

Mdtest is a test tool dedicated to the performance of file system metadata. The main test metrics are creation and stat, and mpirun concurrent testing is required:

Mpirun-- allow-run-as-root-mca btl_openib_allow_ib 1-host yanrong-node0:$ {slots}, yanrong-node1:$ {slots}, yanrong-node2:$ {slots}-np ${num_procs} mdtest-C-T-d / mnt/yrfs/mdtest-I 1-I ${files_per_dir}-z 2-b 8-L-F-r-u

Storage performance testing not only tests the metrics in normal scenarios of the cluster, but also includes other scenarios:

Performance metrics with more than 70% storage capacity or hundreds of millions of files

Performance indicators after node / disk failure

Performance index in the process of expansion

Container storage function

In addition to the core functions of storage (high reliability / high availability / high performance), for container storage, several additional functions are needed to ensure the stable availability of the production environment.

Flexvolume/CSI interface support, dynamic / static PV support

Storage quota. For the administrator of Kubernetes, the storage quota is necessary, otherwise the storage space will be out of control.

Quality of service (QoS). Without QoS, storage administrators can only expect storage to provide other monitoring metrics to ensure that the culprit is identified when the cluster is overloaded

An ever-changing choice

The Kubernetes persistent storage solution focuses on storage and container support. Therefore, the core functions of the storage and the scenario support of the container are considered first. To summarize what is described in this article, list the selections by priority:

Three cores of storage, high reliability, high availability and high performance

Business scenario, select distributed file storage

Scalable, storage can scale out to meet the needs of business growth

Operation and maintainability, storage operation and maintenance is as difficult as storage development, choose convenient storage products

Cost

Quan A

Q: Hello, our company uses GlusterFS storage and mounts three disks. Now we encounter high concurrency of writing small files (4KB) and throughput is not going up (5MB/S). Is there any better monitoring tool or method? Thank you!

A:GlusterFS itself is very unfriendly to small files. GlusterFS is designed for backup scenarios and is not recommended for small file scenarios. If possible, either the program is optimized to merge small files, or high-performance distributed file storage is selected. It is recommended to check out Lustre or YRCloudFile.

Q: we are using Ceph distributed storage. At present, there is a scenario of customer video storage, and for the continuous writing of small files, there is the phenomenon of frame loss. After our system level and underlying file system tuning, coupled with the setting of Ceph parameters, the performance can barely be improved, but we do not know what the performance will be in terms of data volume (tested by customers' bare disk). The previous performance in soft RAID mode is still OK) do you have any suggestions in this respect? Thank you! Our customer is in a special situation, belongs to a specific model, and is a 5400 Sata disk! Rbd block storage! Another phenomenon is that disk utilization is uneven, which is also a personal reason that affects performance, even if we are adjusting the number of pg. As an extra question, can bcache use memory for caching? Which of these two is better than FlushCache?

A: do you use CephFS or rbdc because the lack of performance in Ceph is not enough, and there are many queues, resulting in unstable latency. At this time, you can only bear it. However, it is recommended to use Bcache as a layer of cache, which can effectively alleviate the performance problem. Although Crush algorithm is much better than consistent hash, it is still difficult to control hot disk issues because there is no metadata. FlushCache is no longer maintained, and Bcache is still being maintained by a team, so if you don't have the ability, choose Bcache.

Q: I think you recommend distributed file storage. Can the file system meet the needs of database applications? Would block storage be better?

A: first of all, I recommend a high performance distributed file system. Database is generally sensitive to delay, ordinary 10 Gigabit network + HDD is definitely not good, need to use SSD, generally can stabilize the delay within milliseconds, usually can meet the requirements. If there is a special requirement for delay, we can adopt the scheme of NVMe + RoCE, which can stabilize the delay within 300 microseconds even under high pressure.

Q: why does block storage not support RWX? Does RWX mean that multiple nodes mount the same device and read and write at the same time? A lot of FC storage can do that.

A: traditional SAN needs ALUA mechanism to support RWX, and this is multi-read and write at the block level. If you want to add a file system, it is impossible to do it. It requires a distributed file system to synchronize file metadata information.

Q: traditional SAN supports parallel reading and writing to the same data block. Many AA arrays do not use ALUA, but multiple paths have IO at the same time. Of course, multipath software is used. On the contrary, it is not the AA array that uses ALUA.

A:AA array solves the problem of high availability. For concurrent reads and writes of the same lun, trunk-level locks are needed to ensure data consistency, resulting in poor performance.

Q: many traditional commercial storage, including block storage, have also done CSI-related plug-ins. Is it true that if you run some high-performance businesses in containers, these commercial block storage is more suitable than file storage?

A: in a production environment, I strongly recommend commercial storage. As for block storage or file storage, it depends on your business scenario. The first choice is commercial file storage.

Q: in the current Kubernetes environment, is there any relatively good open source distributed storage solution for a large number of small file RWX scenarios?

A: none of the open source distributed file storage projects can solve a large number of small files. I have analyzed all the mainstream open source file systems in this article. At the beginning of the design, they are all aimed at backup scenarios or HPC areas.

Q: excuse me, is there any evidence for the poor performance of Ceph?

A: speaking directly with the data, the delay we tested with NVMe + Ceph + Bluestore is above millisecond, and it is very unstable. When we use YRCloudFile + NVMe + RoCE, the delay energy is about 50 microseconds, which is dozens of times worse.

Q:Lustre has not been touched, is its performance good? has it been compared with Ceph?

A: there are many performance metrics of Lustre on the Internet. Under the same configuration, the performance is definitely better than Ceph. However, Lustre is all in the kernel state, so the container scenario cannot be used, and it is very difficult to deploy and operate. Lustre is widely used in supercomputing.

Can Q:Lustre only rely on local disk arrays to ensure data redundancy?

A:Lustre itself does not provide redundancy, it all depends on the local array, but EC seems to be in the development plan.

Q: (for small companies) if commercial storage is not selected, which open source implementation is recommended as production storage (reliable, high performance). We tried NFS before and found that the speed was unstable?

A: there are still many start-ups in China, which are not expensive. Storage unlike other projects, storage can not stand trouble, must be stable and reliable, Ceph/GlusterFS for such a long time, people in the purchase, will still rely on a commercial company to do, their own production environment to use open source projects, the risk is too great.

The above content is sorted out according to WeChat group sharing on the evening of January 10, 2019. Sharer Zhang Wentao, storage architect of Beijing Hongrong Technology, is responsible for the architecture design and research and development of container storage products.

About Yan Rongyun

Xilongyun is a high-tech enterprise with software-defined storage technology as its core competitiveness. it has independent intellectual property rights in key technologies such as distributed storage, and is an industry leader in high-performance distributed storage solutions. According to the business characteristics of various industries, create personalized industry solutions to provide one-stop products and services. Yirong cloud series products have served many customers in finance, government, manufacturing, Internet and other industries. For more information about financial technology, please visit the official website http://www.yanrongyun.com.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.