Practical information | how does Ali make his dream of "capacity expansion before peak and capacity reduction after peak" come true? 02/10 Update SLTechnology News&Howtos

Practical information | how does Ali make his dream of "capacity expansion before peak and capacity reduction after peak" come true?

2026-02-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Double 11 has come to a successful conclusion, but the exploration of technology has not stopped.

"Alibaba Database Technology" hereby launches a series of practical information articles "win the War 2018 Shuang11 Mutual-Ali Database & Storage Technology uncover Secrets", which will tell you the story behind the super number 213.5 billion. Please pay attention!

Guide: in the face of the continuous growth of database instances and storage scale, greatly promoting cost and scheduling efficiency, Ali began to apply a new technology architecture-"storage computing separation" on a small scale in 2017 to support flexibility at a lower cost.

In 2018, we achieved full-cell large-scale deployment, storage and computing separation. However, before that, each business needs to make a lot of optimization according to its own business characteristics. This is especially true in the field of database, and the requirements are more demanding and more difficult. Lu Jian, a senior technical expert from Alibaba's Storage Technology Division, will explain in detail how Ali breaks the technical barrier between storage and computing in the promotion of double 11 in 2018, so as to achieve a technological breakthrough that allows flexible and flexible capacity expansion without moving data.

What did we do in 2017?

I remember that as early as 2017, Dr. Wang Jian called on everyone to have a heated discussion on whether "IDC As a Computer" could be achieved. In order to do this, it is necessary to achieve the separation of storage and computing, and then schedule the computing and storage resources independently and freely. Of all the businesses that achieve the separation of storage and computing, the database is the most difficult. Because the database has a very high demand for the delay and stability of IWeiO. But from the point of view of the industry, storage computing separation is a future technology trend, because things like Google spanner and Aurora have been implemented.

So in 2017, we have a firm belief to achieve the separation of storage and computing in the database. In fact, we made it in 2017, based on Pangu and AliDFS (ceph branch), to separately undertake 10% of the transaction volume in Zhangbei unit storage. 2017 is the first year for the database to realize the separation of storage and computing, which has laid a solid foundation for the large-scale realization of storage and computing separation in 2018.

II. 2018 technological breakthroughs

If 2017 is the year when databases achieve a breakthrough in the separation of storage and computing, then 2018 is a year in pursuit of extreme performance and a year from experimentation to large-scale deployment, and its technical challenges can be imagined. On the basis of 2017, the challenge of 2018 is even greater, and it is necessary to make the separation of storage and computing more high-performance, universal, universal and simple.

In 2018, in order to achieve the highest performance and throughput of the database under the separation of storage and computing, we developed a user-mode cluster file system DADI DBFS. By depositing the technology on the DADI DBFS user-mode cluster file, we can separate the full-unit large-scale storage and computing of the database transaction of the enabling group. So what technological innovations did DBFS make when it became a storage platform product?

2.1 user mode technology

2.1.1 "ZERO" copy

We directly implement the "Zero" copy of the Imax O path through the user mode, bypass kernel. Avoid the copy outside the kernel, so that the throughput and performance have been greatly improved.

In the past, when using Kernel state, there would be two data copy, once from the business user-mode process copy data to the core, and once from the in-kernel copy to the user-mode network forwarding process. These two copy will affect the overall throughput and latency.

After switching to the pure user state, we use the polling model to send the I _ max O request request. In addition, for the consumption of CPU under polling mode, we use adaptive sleep technology so that core resources will not be wasted when we are idle.

2.1.2 RDMA

In addition, DBFS combines RDMA technology to directly exchange data with Pangu storage, achieving a delay close to the local SSD and higher throughput, which makes it possible to achieve extremely low latency across the network this year, and lays a strong foundation for the separation of large-scale storage computing. This year, the group participated in the RDMA cluster, which can be said to be the largest cluster in the industry in scale.

2.2 Page cache

In order to implement the capability of buffer I Universe O, we implemented page cache separately. Page cahce is realized by touch count based LRU algorithm. The significance of introducing touch count is to better combine with the Imax O feature of the database. Because there are often behaviors such as large table scans in the database, we do not want this less frequently used data page to wash away the efficiency of LRU. We will move the page on the hotside and the client side based on touch count.

The page size of Page cache is configurable, and when combined with the page size of the database, it will achieve better cache efficiency. Overall, DBFS's page cache has the following capabilities:

Hot and Cold end transfer of page based on touch count

The proportion of hot and cold ends is configurable, and the current ratio of hot and cold is 2:8

Page size can be configured for optimal configuration combined with database pages.

Multi-shard to improve concurrency; overall capacity is configurable

2.3 Asynchronous Icano

In order to improve the capacity of database handling, most databases use asynchronous Ipicuro. In order to be compatible with the Istroke O feature of the upper database, we have implemented asynchronous Ipicuro. Asynchronous Icano feature:

Lock-free queue implementation

The configurable isign O depth enables precise delay control for different database types.

Polling adaptive to reduce CPU consumption

2.4 Atomic Writing

The atomic write function is implemented to ensure that the partial write,DBFS does not appear when the database page is written out. The Innodb based on DBFS can safely turn off the double write buffer, thus saving 100% of the database bandwidth under memory separation.

In addition, for example, PostgreSQL uses buffer buffer O, which also avoids the occasional page failure problem encountered by PG in dirty page flush.

2.5 Online Resize

In order to avoid the data migration caused by the expansion, DBFS combined with the underlying Pangu to realize the online resize of volume. DBFS has its own bitmap allocator for managing the underlying storage space. We optimize the bitmap allocator and achieve the resize of lock free at the file system level, so that the upper layer business can expand the business lossless and efficiently at any time, which is completely better than the traditional ext4 file system.

The support of Online Resize avoids the waste of storage space, because it can be written on expansion without the need for reserve such as 20% storage space.

The following is the process of bitmap change during capacity expansion:

2.6 TCP and RDMA cut each other

The large-scale introduction and use of RDMA in the group database is also a very big risk point. Together with Pangu, DBFS has realized the function of intercutting between RDMA and TCP, and conducted an interchange exercise in the process of the whole link, which makes the risk of RDMA within a controllable range and the stability guarantee more perfect.

In addition, DBFS, Pangu and the network team have carried out a lot of capacity water level pressure tests and fault drills for RDMA, making sufficient preparations for the launch of the largest RDMA in the industry.

2.7 greatly promote deployment in 2018

After achieving the technical breakthrough and tackling key problems, DBFS finally completed the arduous task through the test of promoting the whole link and the double "National Day" examination, which once again verified the feasibility of storage and computing separation and the overall technical trend.

3. Storing DBFS in the middle of Taiwan

In addition to the above functions that must be implemented in the file system, DBFS also implements many features that make the business use DBFS more universal, easier to use, more stable and secure.

3.1 Technical precipitation and empowerment

We precipitate all the technological innovations and functions in DBFS in the form of products, so that DBFS can enable more business implementations to access different underlying storage media in the form of users, and enable more databases to achieve storage and computing separation.

3.1.1 POSIX compatibility

At present, in order to support the database business, we are compatible with most commonly used POSIX file interfaces to facilitate the docking of the upper database business. In addition, it also implements page cache, async Iamp O and atomic write, which provides rich Imax O capabilities for database business. In addition, we also implement the interface of glibc to support the operation and processing of file streams. The support of these two interfaces greatly simplifies the complexity of database access, increases the ease of use of DBFS, and makes DBFS can support more database services.

Posix is no longer listed if you are familiar with it. The following glibc APIs are for reference only:

/ / glibc interface

FILE * fopen (constchar*path,constchar*mode)

FILE * fdopen (int fildes,constchar*mode)

Size_t fread (void*ptr, size_t size, size_t nmemb, FILE * stream)

Size_t fwrite (constvoid*ptr, size_t size, size_t nmemb, FILE * stream)

Intfflush (FILE * stream)

Intfclose (FILE * stream)

Intfileno (FILE * stream)

Intfeof (FILE * stream)

Intferror (FILE * stream)

Voidclearerr (FILE * stream)

Intfseeko (FILE * stream, off_t offset,int whence)

Intfseek (FILE * stream,long offset,int whence)

Off_t ftello (FILE * stream)

Longftell (FILE * stream)

Voidrewind (FILE * stream)

3.1.2 Fuse implementation

In addition, in order to be compatible with Linux ecology, we have implemented fuse to open up the interaction of VFS. The introduction of Fuse enables users to access DBFS without any code changes without considering the extreme performance, which greatly improves the ease of use of the product. In addition, it also greatly facilitates the traditional operation and maintenance operation.

3.1.3 Service capability

DBFS developed shmQ components based on IPC communication with internal memory, thus extending the support for PostgreSQL process-based architecture and MySQL thread-based architecture, making DBFS more universal and secure, and providing a solid foundation for future online upgrades.

ShmQ is based on lock-free implementation, with excellent performance and throughput performance. According to the current tests, the access delay can be controlled within several us under 16K and other large database pages. With the support of service-oriented and multi-process architecture, the current performance and stability are in line with expectations.

3.1.4 Cluster file system

Cluster function is another obvious feature of DBFS. Enabling database is based on shared-disk mode, which realizes the linear expansion of computing resources and saves storage costs for business. In addition, the mode of shared-disk also provides fast flexibility for the database, and also greatly improves the SLA of fast switching between active and standby. The cluster file system provides the ability to write multiple reads and multiple writes, which lays a solid foundation for database shared-disk and shared nothing architecture. Compared with the traditional OCFS, we implement it in the user mode with better performance and more autonomous control. OCFS relies heavily on Linux's VFS, such as no independent page cache, etc.

When DBFS supports one-write-multiple-read mode, multiple roles can be selected, there can be one M node, multiple S nodes use shared data, and M nodes and S nodes jointly access Pangu data. The upper database restricts the Mjump S node, the data access of M node is read and writable, and the data access of S node is read-only. If the main library fails, it will be switched. Master-slave switching steps:

When the business monitoring indicator detection finds that the M node is inaccessible or abnormal, it makes a decision on whether to switch.

If a switch occurs, the control platform initiates the switch command, and the switch command is completed, which means that both the DBFS and the upper database have completed the role switch.

In the process of DBFS switching, the most important action is IO fence, which forbids the original M-node IO capability to prevent double writing.

DBFS performs global metalock control and blockgroup allocation optimization on all nodes when writing to multiple points. In addition, it will also involve the quorum algorithm based on disk, which is more complex and will not be described in detail for the time being.

3.2 combination of hardware and software

With the emergence of new storage media, the database is bound to play a better performance or lower cost optimization, and to achieve autonomous control of the underlying storage media.

From the perspective of Intel's planning for storage media, from performance to capacity, there will be three products: AEP,Optane and SSD, while in the direction of large capacity, there will be QLC. So in terms of overall performance and cost, we think Optane is a relatively good cache product. We chose it as the implementation of DBFS header persistence filecache.

3.2.1 persistent file cache

DBFS implements the local persistence cache function based on Optane, which further improves the read and write performance of the database under the memory separation. File cache has done a lot of work to achieve production availability, such as:

Stable and reliable fault handling

Support for dynamic enable and disable

Support load balancing

Support performance metrics collection and display

Support data correctness scrub

The support of these functions lays a solid foundation for online stability. Among them, the iBand O for Optane is the pure household technology of SPDK, and DBFS is implemented with vhost of Fusion Engine. The page size of File Cache can be optimally configured according to the block size of the upper database to achieve the best results.

The following is an architectural diagram of file cache:

The following is the read and write performance benefit data from the test:

The one with "cache" is based on filecache. As the overall performance increases, the reading delay begins to decrease. In addition, we monitor many performance indicators for file cache.

3.2.2 Open Channel SSD

X-Engine works with DBFS and the Fusion Engine team to further build a storage autonomous and controllable system based on object SSD. Deep exploration and practice have been carried out in the fields of reducing SSD wear, improving SSD throughput and reducing mutual interference between reading and writing, and have achieved very good results. At present, it has been combined with X-Engine 's hierarchical storage strategy to open up the read and write path, and we look forward to the next step of more in-depth intelligent storage research and development.

IV. Summary and Prospect

In 2018, DBFS has massively supported X-DB to support "11.11" promotion in the form of storage and computing separation; at the same time, it also enables ADS to achieve write-to-read capabilities and Tair.

While supporting business, DBFS itself has opened the support of PG process and MySQL thread architecture, opened the VFS interface, achieved compatibility with Linux ecology, and become a real storage platform-level product-cluster user mode file system. In the future, combined with more software and hardware combination, hierarchical storage, NVMeoF and other technologies to enable more databases, to achieve its greater value.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.