In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
The following brings you the performance and advantages of UCloud cloud disk. I hope it can give you some help in practical application. Cloud disk involves many things, and there are not many theories. There are many books on the Internet. Today, we will use the accumulated experience in the industry to do an answer.
Key points of architecture upgrade
Through the analysis of the problems and needs at this stage, we have sorted out the specific points of this architecture upgrade:
1. Solve the limitation that the original software architecture can not give full play to the hardware capability.
2. Support SSD cloud disk, provide QOS guarantee, and give full play to the IOPS and bandwidth performance of back-end NVME physical disk. The IOPS of a single cloud disk can reach 2.4W.
3. Support larger capacity cloud storage, 32T or even larger
4. Hot issues of fully reducing IO traffic
5. Thousands of cloud disks can be created and thousands of cloud disks can be mounted concurrently.
6. Support online migration of old cloud disks to new ones, and support online migration of ordinary cloud disks to SSD cloud disks.
Practice of transformation of new architecture
Modification 1: IO path optimization
In the old architecture, the whole IO path has three layers, the first layer hosts the Client side, the second layer Proxy side, and the third layer stores the Chunk layer. Proxy is responsible for the route acquisition and caching of IO; the read and write of IO is forwarded to the next storage layer, and is responsible for IO write three copies.
In the new architecture, route acquisition is handed over to Client,IO read-write Client, which can directly access the storage Chunk layer, and write three copies are also given to Chunk. The entire IO path becomes two layers, one is the host Client side, and the other is the storage Chunk layer.
After the architecture upgrade, read IO, a network request directly to the back-end storage node, the old architecture is 2 times. For the write IO, the master replica IO makes a network request directly to the backend storage node, and the other 2 replicas are forwarded twice through the master replica, while the three replicas of the old architecture are all forwarded twice. The read IO delay is reduced by an average of 0.2-1ms, the write tail delay is reduced, and the overall delay is also effectively reduced.
Modification 2: metadata fragmentation
In distributed storage, the data is fragmented, so that each shard is scattered and stored in the cluster as multiple copies. As shown below, a 200-gigabyte cloud disk has 200 shards if the shard size is 1 gigabyte. In the old architecture, the shard size is 1G. In the actual business process, we found that the IO hotspots of some services are concentrated in a small range. If 1G sharding is used, the performance of ordinary SATA disks will be very poor. And in the SSD cloud disk, IO traffic can not be evenly scattered to each storage node.
In the new architecture, we support fragmentation with a size of 1m. 1m slicing, which can make full use of the ability of the whole cluster. In high-performance storage, because the performance of solid-state disk is better, business IO hotspots are concentrated in a small range, and better performance can be obtained.
However, UCloud metadata is pre-allocated and mounted. When applying for cloud disk, the system directly allocates all metadata and loads it into memory. When sharding is too small, the amount of metadata that needs to be allocated or mounted at the same time will be very large, so it is easy to time out and cause some requests to fail.
For example, if you apply for 100 300g cloud disks at the same time, if you slice according to 1G, you need to allocate 3W pieces of metadata at the same time; if you slice according to 1m, you need to allocate 3000W pieces of metadata at the same time.
To solve the problem of metadata allocation / mount failure caused by shard size, we try to change the allocation strategy of IO, that is, when the cloud disk is mounted, the allocated metadata is loaded into memory. In the case of IO, if the IO range hit has already assigned a route, IO; is performed according to the route in memory. If the IO range hit does not have an assigned route, the metadata module is asked to assign a route in real time, and the route is stored in memory.
If you apply for 100 300 GB cloud disks at the same time, mount and trigger IO at the same time, about 1000 IOPS will be generated when you press IO. A worst-case scenario triggers a 1000 * 100 = 10W metadata allocation. There is still a large consumption on the IO path.
Finally, in the new architecture, we abandon the scheme of storing fragmented metadata in the central node and adopt a unified set of rules to calculate and obtain routes.
In this scheme, the same computing rules R (shard size, number of pg, mapping method, conflict rules) are used in the Client and the back end of the cluster; when applying for cloud disk, the metadata node uses the computing rule quad to determine whether the capacity is satisfied; when mounting the cloud disk, it obtains the computing rule quad from the metadata node In IO, the exit route metadata is calculated according to the calculation rule R (fragment size, number of pg, mapping method, conflict rules) and then IO is carried out directly. Through this transformation scheme, we can ensure that in the case of 1m data fragmentation, the allocation and mounting of metadata are unimpeded, and save the consumption on the IO path.
Modification 3: support SSD high-performance cloud disk
Through the above comparison, we can see that the performance of NVME solid state disk is a hundred times higher than that of mechanical disk, but it needs the matching design of software to make use of the ability of NVME solid state disk.
SSD cloud disk provides QoS guarantee, single disk IOPS:min {1200mm 30 * capacity, 24000} for SSD cloud disk, the traditional single-thread mode will be the bottleneck, and it is difficult to support hundreds of thousands of IOPS and 1-2GB bandwidth of back-end NVME disk, so we adopt the multi-thread model.
In order to launch SSD cloud disk quickly, we still use the traditional TCP network programming model instead of Kernel Bypass. At the same time, through some software details optimization, to reduce CPU consumption.
At present, a single thread can write up to 6W IOPS and read up to 8W IOPS,5 threads can basically take advantage of the capabilities of NVME solid state drives. Currently, we can provide cloud disk IO capabilities as follows:
Modification 4: the ability to prevent overload
For ordinary cloud disks, the software of the new architecture is no longer a bottleneck, but for general mechanical hard disks, the queue concurrent size can only be supported to about 32-128. With 100 cloud disks, when several IO hits a physical HDD disk at the same time, more problems such as io_submit time-consuming or failure will occur because the queue and release of HDD disk is small. After judging the IO timeout, the Client side will retry IO transmission, resulting in more and more IO packets in the TCP buffer on the Chunk side, and more and more timeouts together, resulting in system overload.
For ordinary cloud disks, you need to control the size of the concurrent submission queue. Traverse all the cloud disks according to the queue size, and issue the IO of each cloud disk, as shown in figure 1, 2 and 3 above. In the actual code logic, you also need to consider the weight of the cloud disk size.
For SSD cloud disks, the traditional single thread will be the bottleneck, it is difficult to support hundreds of thousands of IOPS and 1-2GB bandwidth.
In the stress test, we simulate a scenario in which hotspots are concentrated on a thread, and find that the thread CPU is basically fully loaded at 99% Rue 100%, while other threads are idle. Later, we use the way of regularly reporting thread CPU and disk load status. When a thread is continuously busy and a thread is idle, we select the IO of some disk fragments to switch to idle threads to avoid partial thread overload.
Transformation 5: online migration
The performance of ordinary cloud disk in the old architecture is poor, and the business of some ordinary cloud disk users is developing rapidly. We hope to migrate from ordinary cloud disk to SSD cloud disk to meet the higher needs of business development. At present, there are two sets of old architecture online. In order to achieve the purpose of online migration quickly, we decided to support online migration from the periphery of the system in the first phase.
The migration process is as follows:
1 the backend sets the migration flag
2 Qemu connection reset to Trans Client
3 write the IO stream to the Trans module through Trans Client, and the Trans module carries out double writing: one writes the old architecture and the other writes the new architecture.
4 Trigger traverses the disk and triggers the data command according to 1m size to trigger the data background relocation to Trans. Before the relocation is completed, IO reads from Trans to the old architecture Proxy
5 when all the relocation is completed, the Qemu connection is reset to the new architecture Client to complete the online migration.
Add a layer of Trans and double write, resulting in some performance loss during the migration. However, for ordinary cloud disks, it is acceptable during migration. We are also building Journal-based online migration capabilities for the new architecture, with the goal of keeping the performance impact below 5% during the migration period.
After the above series of modifications, the new cloud disk architecture has basically completed the initial upgrade goal. At present, the new architecture has been officially launched and successfully applied to daily business. Here, I would also like to talk about some of the work we are working on.
1. The capacity can be expanded indefinitely.
For each availability zone, there will be multiple storage clusters Set. Each Set provides storage around 1PB (we did not allow the cluster to expand indefinitely). When the cloud disk of Set1 needs to be expanded from 1T to 32T / 100T, you may encounter the problem of insufficient capacity of Set1.
Therefore, we plan to divide the logical disk applied by users into Part, and each Part can be applied in the Set that is no longer used, so that the capacity can be expanded indefinitely.
2. Ultra-high performance storage
In the past 10 years, the hard disk has undergone the development of HDD-> SATA SSD-> NVME SSD. At the same time, the network interface has also experienced a great-leap-forward development of 10G-> 25G-> 100G. However, the main frequency of CPU has almost no great development, with an average of 2-3GHZ. A physical machine we use can hang 6-8 NVME disks, which means that a physical machine can provide 300-5 million IOPS.
Under the traditional application cloud server software mode, the Epoll Loop based on TCP, the transceiver packet of the network card and the read and write of the IO have to go through multi-layer copy and switching in user mode and kernel mode, and need to be awakened by the interrupt of the kernel, so it is difficult for the software to squeeze out the full capacity of the hardware. For example, in IOPS and latency, IOPS can only be increased by superimposing threads. However, it is difficult for IOPS to grow linearly with the number of threads, and the delay jitter is high.
We hope to optimize the three IO paths in the above figure by introducing the technical scheme of zero copy, user mode and polling, so as to reduce the multi-layer copy and switching of user mode, kernel state and protocol stack, and squeeze out the full capability of the hardware combined with polling.
Finally, we chose three technical solutions of RDMA,VHOST,SPDK.
Option 1: ultra-high performance storage-VHOST
The traditional mode is as follows: IO is driven by virtual machine and Qemu, and then through Unix Domain Socket to Client. After many times of user-state kernel state, as well as copies on the IO path.
Using VHOST User mode, the shared memory can be used for data transmission from VM to Client in user mode. In practice, we take advantage of SPDK VHOST.
In the R & D environment, we simulate and return Client to VM immediately after receiving the IO request, that is, we do not send data to the storage backend, and the data obtained is as shown in the figure above. The delay of single queue can be reduced by 90us.IOPS can be improved by tens of times.
Option 2: ultra-high performance storage-RDMA+SPDK
RDMA provides a message service that applications can directly access virtual memory on remote computers using RDMA. RDMA reduces CPU usage and memory bandwidth bottleneck, provides high bandwidth, and uses Stack Bypass and zero-copy technology to provide low latency.
SPDK can directly access NVME solid state disk in user mode with high concurrency and zero copy in user mode. The polling mode is used to avoid the overhead of kernel context switching and interrupt processing.
Currently, the team is developing a storage engine framework using RDMA and SPDK. In the test environment, a NVME solid state disk is used at the back end. We can improve it on single queue and IOPS as follows:
Including the Client side of SPDK VHOST USER and the storage side of RDMA+SPDK, a public beta version is expected in December.
After reading the above about the performance and advantages of UCloud cloud disk, if there is anything else you need to know, you can find out what you are interested in in the industry information or find our professional and technical engineers for answers. Technical engineers have more than ten years of experience in the industry.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.