In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/02 Report--
What did we do in 2017?
I remember that as early as 2017, Dr. Wang Jian called everyone to have a heated discussion on whether "IDC As a Computer" could be achieved. In order to do this, it is necessary to achieve the separation of storage and computing, and then schedule the computing and storage resources independently and freely. Of all the businesses that achieve the separation of storage and computing, the database is the most difficult. Because the database has a very high demand for the delay and stability of IWeiO. But from the point of view of the industry, storage computing separation is a future technology trend, because things like Google spanner and Aurora have been implemented.
So in 2017, we have a firm belief to achieve the separation of storage and computing in the database. In fact, we made it in 2017, based on Pangu and AliDFS (ceph branch), to separately undertake 10% of the transaction volume in Zhangbei unit storage. 2017 is the first year for the database to realize the separation of storage and computing, which has laid a solid foundation for the large-scale realization of storage and computing separation in 2018.
Second, 2018 technological breakthroughs?
If 2017 is the year when databases achieve a breakthrough in the separation of storage and computing, then 2018 is a year in pursuit of extreme performance and a year from experimentation to large-scale deployment, and its technical challenges can be imagined. On the basis of 2017, the challenge of 2018 is even greater, and it is necessary to make the separation of storage and computing more high-performance, universal, universal and simple.
Cdn.com/7feb17e169ca3276773e24f0317606d8d68a7b78.png ">
In 2018, in order to achieve the highest performance and throughput of the database under the separation of storage and computing, we developed a user-mode cluster file system DADI DBFS. By depositing the technology on the DADI DBFS user-mode cluster file, we can separate the full-unit large-scale storage and computing of the database transaction of the enabling group. So what technological innovations did DBFS make when it became a storage platform product?
2.1 user mode technology
2.1.1 "ZERO" copy
We directly implement the "Zero" copy of the Imax O path through the user mode, bypass kernel. Avoid the copy outside the kernel, so that the throughput and performance have been greatly improved.
In the past, when using Kernel state, there would be two data copy, once from the business user-mode process copy data to the core, and once from the in-kernel copy to the user-mode network forwarding process. These two copy will affect the overall throughput and latency.
After switching to the pure user state, we use the polling model to send the I _ max O request request. In addition, for the consumption of CPU under polling mode, we use adaptive sleep technology so that core resources will not be wasted when we are idle.
2.1.2 RDMA
In addition, DBFS combines RDMA technology to directly exchange data with Pangu storage, achieving a delay close to the local SSD and higher throughput, which makes it possible to achieve extremely low latency across the network this year, and lays a strong foundation for the separation of large-scale storage computing. This year, the group participated in the RDMA cluster, which can be said to be the largest cluster in the industry in scale.
2.2 Page cache
In order to implement the capability of buffer I Universe O, we implemented page cache separately. Page cahce is realized by touch count based LRU algorithm. The significance of introducing touch count is to better combine with the Imax O feature of the database. Because there are often behaviors such as large table scans in the database, we do not want this less frequently used data page to wash away the efficiency of LRU. We will move the page on the hotside and the client side based on touch count.
The page size of Page cache is configurable, and when combined with the page size of the database, it will achieve better cache efficiency. Overall, DBFS's page cache has the following capabilities:
Hot and Cold end Migration of page based on ● touch count
The proportion of hot and cold ends of ● is configurable, and the current ratio of hot and cold is 2:8.
● page size is configurable for optimal configuration combined with database pages
● multi-shard to improve concurrency; overall capacity is configurable
2.3 Asynchronous Icano
In order to improve the capacity of database handling, most databases use asynchronous Ipicuro. In order to be compatible with the Istroke O feature of the upper database, we have implemented asynchronous Ipicuro. Asynchronous Icano feature:
● Lock-free queue implementation
● 's configurable iUnip O depth enables precise delay control for different types of iUnip O in databases.
● polling adaptive to reduce CPU consumption
2.4 Atomic Writing
The atomic write function is implemented to ensure that the partial write,DBFS does not appear when the database page is written out. The Innodb based on DBFS can safely turn off the double write buffer, thus saving 100% of the database bandwidth under memory separation.
In addition, for example, PostgreSQL uses buffer buffer O, which also avoids the occasional page failure problem encountered by PG in dirty page flush.
2.5 Online Resize
In order to avoid the data migration caused by the expansion, DBFS combined with the underlying Pangu to realize the online resize of volume. DBFS has its own bitmap allocator for managing the underlying storage space. We optimize the bitmap allocator and achieve the resize of lock free at the file system level, so that the upper layer business can expand the business lossless and efficiently at any time, which is completely better than the traditional ext4 file system.
The support of Online Resize avoids the waste of storage space, because it can be written on expansion without the need for reserve such as 20% storage space.
The following is the process of bitmap change during capacity expansion:
2.6 TCP and RDMA cut each other
The large-scale introduction and use of RDMA in the group database is also a very big risk point. Together with Pangu, DBFS has realized the function of intercutting between RDMA and TCP, and conducted an interchange exercise in the process of the whole link, which makes the risk of RDMA within a controllable range and the stability guarantee more perfect.
In addition, DBFS, Pangu and the network team have carried out a lot of capacity water level pressure tests and fault drills for RDMA, making sufficient preparations for the launch of the largest RDMA in the industry.
2.7 greatly promote deployment in 2018
After achieving the technical breakthrough and tackling key problems, DBFS finally completed the arduous task through the test of promoting the whole link and the double "National Day" examination, which once again verified the feasibility of storage and computing separation and the overall technical trend.
3. Storing DBFS in the middle of Taiwan
In addition to the above functions that must be implemented in the file system, DBFS also implements many features that make the business use DBFS more universal, easier to use, more stable and secure.
3.1 Technical precipitation and empowerment
We precipitate all the technological innovations and functions in DBFS in the form of products, so that DBFS can enable more business implementations to access different underlying storage media in the form of users, and enable more databases to achieve storage and computing separation.
3.1.1 POSIX compatibility
At present, in order to support the database business, we are compatible with most commonly used POSIX file interfaces to facilitate the docking of the upper database business. In addition, it also implements page cache, async Iamp O and atomic write, which provides rich Imax O capabilities for database business. In addition, we also implement the interface of glibc to support the operation and processing of file streams. The support of these two interfaces greatly simplifies the complexity of database access, increases the ease of use of DBFS, and makes DBFS can support more database services.
Posix is no longer listed if you are familiar with it. The following glibc APIs are for reference only:
/ / glibc interface
FILE * fopen (constchar*path,constchar*mode)
FILE * fdopen (int fildes,constchar*mode)
Size_t fread (void*ptr, size_t size, size_t nmemb, FILE * stream)
Size_t fwrite (constvoid*ptr, size_t size, size_t nmemb, FILE * stream)
Intfflush (FILE * stream)
Intfclose (FILE * stream)
Intfileno (FILE * stream)
Intfeof (FILE * stream)
Intferror (FILE * stream)
Voidclearerr (FILE * stream)
Intfseeko (FILE * stream, off_t offset,int whence)
Intfseek (FILE * stream,long offset,int whence)
Off_t ftello (FILE * stream)
Longftell (FILE * stream)
Voidrewind (FILE * stream)
3.1.2 Fuse implementation
In addition, in order to be compatible with Linux ecology, we have implemented fuse to open up the interaction of VFS. The introduction of Fuse enables users to access DBFS without any code changes without considering the extreme performance, which greatly improves the ease of use of the product. In addition, it also greatly facilitates the traditional operation and maintenance operation.
3.1.3 Service capability
DBFS developed shmQ components based on IPC communication with internal memory, thus extending the support for PostgreSQL process-based architecture and MySQL thread-based architecture, making DBFS more universal and secure, and providing a solid foundation for future online upgrades.
ShmQ is based on lock-free implementation, with excellent performance and throughput performance. According to the current tests, the access delay can be controlled within several us under 16K and other large database pages. With the support of service-oriented and multi-process architecture, the current performance and stability are in line with expectations.
3.1.4 Cluster file system
Cluster function is another obvious feature of DBFS. Enabling database is based on shared-disk mode, which realizes the linear expansion of computing resources and saves storage costs for business. In addition, the mode of shared-disk also provides fast flexibility for the database, and also greatly improves the SLA of fast switching between active and standby. The cluster file system provides the ability to write multiple reads and multiple writes, which lays a solid foundation for database shared-disk and shared nothing architecture. Compared with the traditional OCFS, we implement it in the user mode with better performance and more autonomous control. OCFS relies heavily on Linux's VFS, such as no independent page cache, etc.
When DBFS supports one-write-multiple-read mode, multiple roles can be selected, there can be one M node, multiple S nodes use shared data, and M nodes and S nodes jointly access Pangu data. The upper database restricts the Mjump S node, the data access of M node is read and writable, and the data access of S node is read-only. If the main library fails, it will be switched. Master-slave switching steps:
When the ● service monitoring indicator detects that the M node is inaccessible or abnormal, it makes a decision on whether to switch.
If a switch occurs in ●, the management and control platform initiates the switching command, and the switching command is completed, which means that both DBFS and the upper database have completed the role switch.
In the process of DBFS handover, the main action of ● is IO fence, which forbids the original M-node IO capability and prevents double writing.
DBFS performs global metalock control and blockgroup allocation optimization on all nodes when writing to multiple points. In addition, it will also involve the quorum algorithm based on disk, which is more complex and will not be described in detail for the time being.
3.2 combination of hardware and software
With the emergence of new storage media, the database is bound to play a better performance or lower cost optimization, and to achieve autonomous control of the underlying storage media.
From the perspective of Intel's planning for storage media, from performance to capacity, there will be three products: AEP,Optane and SSD, while in the direction of large capacity, there will be QLC. So in terms of overall performance and cost, we think Optane is a relatively good cache product. We chose it as the implementation of DBFS header persistence filecache.
3.2.1 persistent file cache
DBFS implements the local persistence cache function based on Optane, which further improves the read and write performance of the database under the memory separation. File cache has done a lot of work to achieve production availability, such as:
Stable and reliable fault handling of ●
● supports dynamic enable and disable
● supports load balancing
● supports collection and display of performance metrics
● supports data correctness scrub
The support of these functions lays a solid foundation for online stability. Among them, the iBand O for Optane is the pure household technology of SPDK, and DBFS is implemented with vhost of Fusion Engine. The page size of File Cache can be optimally configured according to the block size of the upper database to achieve the best results.
The following is an architectural diagram of file cache:
The following is the read and write performance benefit data from the test:
The one with "cache" is based on filecache. As the overall performance increases, the reading delay begins to decrease. In addition, we monitor many performance indicators for file cache.
3.2.2 Open Channel SSD
X-Engine works with DBFS and the Fusion Engine team to further build a storage autonomous and controllable system based on object SSD. Deep exploration and practice have been carried out in the fields of reducing SSD wear, improving SSD throughput and reducing mutual interference between reading and writing, and have achieved very good results. At present, it has been combined with X-Engine 's hierarchical storage strategy to open up the read and write path, and we look forward to the next step of more in-depth intelligent storage research and development.
IV. Summary and Prospect
In 2018, DBFS has massively supported X-DB to support "11.11" promotion in the form of storage and computing separation; at the same time, it also enables ADS to achieve write-to-read capabilities and Tair.
While supporting business, DBFS itself has opened the support of PG process and MySQL thread architecture, opened the VFS interface, achieved compatibility with Linux ecology, and become a real storage platform-level product-cluster user mode file system. In the future, combined with more software and hardware combination, hierarchical storage, NVMeoF and other technologies to enable more databases, to achieve its greater value.
The author of this article: Lu Jian
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.