Frontier | interpretation of VLDB 2019 papers: innovation and practice of Alibaba's Intelligent Parameter Optimization of large-scale Database 07/02 Update SLTechnology News&Howtos

Frontier | interpretation of VLDB 2019 papers: innovation and practice of Alibaba's Intelligent Parameter Optimization of large-scale Database

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Original: https://www.modb.pro/db/6143

Preface

The annual top database conference VLDB 2019 is held in Los Angeles from August 26th to August 30th local time. At this conference, a number of papers from the Ali Cloud database product team were selected into Research Track and Industrial Track.

This article will make a detailed interpretation of the ResearchTrack-shortlisted paper "iBTune: Individualized Buffer Tuning for Largescale Cloud Databases" in order to provide dinner for readers.

Background

About five or six years ago, the Ali database team began to try how to translate the experience of DBA into products to provide more efficient and intelligent database services for business development. CloudDBA began to provide self-service intelligent diagnosis optimization service for users in 14 years. After four years of continuous exploration and efforts, 18 years evolved to CloudDBA's next generation product-autonomous database platform SDDP (Self-Driving Database Platform).

SDDP is an intelligent database platform that endows a variety of databases with self-driving ability, so that the database running on this platform has the ability of self-perception, self-decision-making, self-recovery and self-optimization, and provides users with uninterrupted service without perception. Autonomous database platform covers a lot of capabilities, including physical resource management, instance life cycle management, diagnosis optimization, security, elastic scaling and so on. Automatic anomaly diagnosis and recovery and automatic optimization is one of the core capabilities of autonomous database platform.

At the end of 2017, SDDP began to carry out end-to-end automatic optimization of database instances across the network, including not only the common automatic slow SQL optimization and automatic space optimization, but also the automatic parameter optimization of large-scale databases highlighted in this article.

Database parameter optimization based on data-driven and machine learning algorithm is a hot direction of database intelligent optimization in recent years, but it is also faced with great technical challenges. The problem to be solved is how to automatically configure millions of database instances running different businesses in a large-scale database scenario, weigh the performance and cost at the same time, and have the lowest resource cost on the premise of meeting the SLA. This technology is of great value to CSP (Cloud Service Provider).

In the past year or two, there have been some researches in this direction (such as CMU's OtterTune), but the algorithm relies on some manual prior experience and is not scalable in large-scale scenarios. It is understood that other cloud manufacturers Azure SQL Database and AWS have invested in this direction, and have not yet seen the relevant papers or product releases.

From the beginning of 18 years, we began to explore the intelligent parameter optimization of database, from problem definition, key algorithm design, algorithm evaluation and improvement, to the final end-to-end automation process landing. A number of teams have worked together to achieve a technological breakthrough and achieve a large-scale landing.

The paper "iBTune: Individualized Buffer Tuning for Largescale Cloud Databases" co-written by Tan Jian, Tiewin, Feidao, Iao, Qixing, Chiyuan, Hong Lin, Shi Yue, Ming Song and Zhang Rui was accepted by VLDB 2019 Research Track, which is an important milestone for Alibaba in the direction of database intelligence.

This work not only puts forward innovative ideas in the theory of database intelligent parameter optimization, but also has achieved large-scale landing on Ali Group ~ 10000 instances, saving a total of ~ 12% of memory resources. at present, it is the only company in the industry that really realizes the large-scale landing of database intelligent parameter optimization.

Problem definition

Parameter optimization is an important means of database optimization, and the large number of database parameters also increases the difficulty of parameter tuning. For example, the latest version of MySQL has more than 500 and 290 PostgreSQL parameters. Usually, database tuning optimization is mainly concerned with performance-related parameters, among which the setting of Buffer Pool has the greatest impact on performance.

At present, the deployment mode of sharing hosts for multiple database instances in the group environment often leads to a serious shortage of host memory, but there are still more CPU and storage resources left, resulting in a waste of machine resources, so the shortage of memory resources has become a key bottleneck affecting the deployment density of database instances.

Buffer Pool is the largest consumption of memory resources, how to achieve the optimal configuration of Buffer Pool is the key to affect the machine cost of the whole network, but also the key to affect the performance of database instances, so we focus on intelligent parameter optimization on Buffer Pool parameter optimization.

For large-scale database scenarios, the challenge is how to configure a reasonable Buffer Pool Size for each database instance, so that Buffer Pool Size can be minimized without affecting the performance of the instance. In order to facilitate unified management and control in traditional large-scale database scenarios, static configuration templates are usually used to configure database instance parameters.

Take the database scenario of Ali Group as an example, 10 database instances of BufferPool specifications are provided for business parties to choose from. When applying for an example, developers usually choose the default configuration specification or higher configuration specification because they are not clear about their business requirements for BP. This allocation of resources has brought about a serious waste of resources.

In addition, business diversity and continuous variability make it impossible for the traditional manual tuning method to rely on DBA in large-scale scenarios, so it is an important research problem to dynamically adjust database Buffer Pool according to database load and performance changes based on data-driven and machine learning algorithms.

Analysis of problems

From the point of view of the problem itself, there is a direct relationship between cache size (BP) and cache hit ratio (hit ratio). Imagine that if you can find a formula BP=Function (hit_ratio), and then find a business acceptable cache hit ratio from the perspective of the business side or DBA, you can lower the BP without affecting the business.

After investigation, we find that in the Cache research field of operating system, researchers have done a lot of research on the corresponding relationship between buffer size and hit ratio, among which some studies show that the relationship between buffer size and hit ratio obeys Power Law distribution in the part of long tail of data, that is:

Among them

α I is the index of Power Law distribution, miss_ratio=1 − hit_ratio.

After the above theoretical research, the relational model has been solved, and then we need to solve the hyperparameter in the model, that is, the α I of MySQL database Cache.

With the help of the Frodo tool developed by the group's DBA students, we have carried out pressure test experiments with different BP configurations for several important OLTP scenarios in the group (such as shopping cart scenario, transaction payment scenario). The experimental results also confirm the previous theoretical results that the cache of MySQL in the long tail part is indeed in line with the Power Law distribution hypothesis.

Therefore, given the current miss rate mrcur, the current BP size BPcur and the target miss rate mrtarget, we can derive the following formula to calculate the adjusted BP BPtarget:

Through the above theoretical research and the verification of the actual MySQL scenario, the theoretical formula is verified to be valid in the MySQL scenario, and the key coefficient α I can be calculated. Next, let's see how to calculate miss_ratiotarget.

Looking for a suitable miss ratio

There are tens of thousands of database instance master nodes in Alibaba Group. We consider looking for similar instances from these tens of thousands of databases, and then use the miss ratio of these similar instances to find the target miss ratio of the instance to be adjusted.

In feature selection, we choose CPU usage, logical read, io read, miss ratio, response time and other performance indicators to describe a business workload, and select several statistics (such as mean, media, 70th percentile, 90th percentile) for these features as specific feature values.

In order to reduce the impact of weekdays and weekends on the data, we select the performance data with a span of 4 weeks to calculate the similarity. The following figure shows two pairs of similar examples.

Algorithm challenge

After the treatment of the former part, the formula, parameters and target mr are all available, which can be substituted into the formula to calculate the target BP. Next, we need to solve the problems faced by the algorithm in the process of project landing.

Because the index of hit ratio can not directly reflect the impact of the database on the business, the business side and DBA do not have a direct sense of body, and this index can not be used to directly measure the stability of the database business. Therefore, due to the requirement of stability, the algorithm can not land the specific service when it is unable to give the quantitative value of the impact on the service.

In order to solve this problem, after many discussions with DBA and the business side, we find that the business side and DBA are most concerned about the Response Time (RT) of the database, especially the maximum RT of the database instance to the application service.

Imagine that if you can predict the worst value of the database instance RT adjusted by BP, that is, the upper bound RT upperbound of RT, then you can quantitatively describe the impact on the business after adjusting BP, and eliminate the worries of the business side and DBA about the optimization of this parameter, and the algorithm will have the necessary conditions for landing production environment. Therefore, we make an algorithm prediction for the database instance RT upperbound.

RT prediction model

To solve the problem of RT prediction, we propose a DNN model of pairwise. The specific structure is shown in figure.

In the training phase, the left instance represents the instance of the RT we want to predict, and the right instance represents a similar instance (the similar instance selected when looking for mrtarget), and the RT of the left and right instances is known.

In the test phase, we use this model to predict the RT adjusted by the instance BP. The left and right instances are all current instances, but replace mrcur with mrtarget.

The full connection form is adopted in the DNN network model, the activation function is ReLU, and the number of nodes in the hidden layer is 100,50,50,50 respectively.

Loss function:

Among them I (.) It is an indicator function, e=RT − RTpredict, λ ∈ [0jue 1] is a superparameter, e=RT − RTpredict, which is used to control the degree of punishment in underestimate.

Where y is the actual RT observation. Because of the large range of different instances y, we regularize e with y and increase η to avoid the optimization error introduced by very small RT.

Experiment

In the experiment of predicting RT, we compared the regression algorithms including linear regression model (LR), XGBoost, RANSAC, decision tree (DTree), ENet, AdaBoast linear regression (Ada), GBDT, k nearest neighbor regression (KNR), bagging Regressor (BR), extremely randomized trees regressor (ETR), random forest (RF), sparse subspace clustering (SSC), DNN model, DNN (I2V-DNN) model with embedding layer for instance-to-vector transformation. And deep learning algorithms such as pairwise DNN model.

The structure of I2V-DNN is shown in the figure:

In order to prove the universality of the algorithm, we select 1000 instances from several important business scenarios of the group database, covering examples with different read-write ratios, including read-only examples, write-only instances, read-write balanced instances and so on.

In terms of evaluating the effectiveness of the algorithm, we mainly use the following three evaluation indicators:

Among them, AMRAE can evaluate the error ratio of RT prediction results, MAE is used to measure the average error of RT prediction, and UMAE is used to measure the underestimation of RT prediction.

On the experimental data set, the comparison of RT prediction results is shown in the figure:

As can be seen from the above figure, the PW-DNN model has obvious advantages over other algorithms in the index of AMRAE. Combining other indicators, the algorithm effect of PW-DNN model is the best, so we finally choose to use the algorithm to predict RT is PW-DNN.

Actual effect

In order to more intuitively observe the changes of instances before and after changing BP, we randomly selected 10 instances to show the database metrics before and after adjusting BP. The data is shown in the figure:

As can be seen from the figure above, the RT of instances with different specifications after adjusting the BP is not much different from that before the adjustment (except for instance 1). Through QPS and CPU usage, we can see that the business visit volume before and after the adjustment is not much different, and the resource consumption is very similar, but different degrees of memory are saved.

In example 1, the adjusted RT increases significantly. After careful investigation of the case, it is found that the daily QPS of the business is very low, and only one query is the most time-consuming. After the adjustment, the value of the query query is different, resulting in a significant increase in logical read and physical read, so the average RT value is also much higher in the end. However, the absolute value of the adjusted RT is not large, and there is no slow SQL exception, which is acceptable to the business, so the rollback operation is not triggered.

Land on the ground

We implement an end-to-end algorithm landing process, from data acquisition to the execution of BP optimization instructions. The system consists of four main modules, namely, index acquisition, data processing, decision-making and execution. The module design is shown in figure.

Index collection: the database management and control platform has realized the index data collection of all database instances in the group, covering the indicators used by the algorithm.

Data processing: the collected indicators are processed by stream for statistical summary of different window dimensions, and stored in odps for algorithm use.

Decision-making: the specific implementation part of this algorithm reads the index statistical data stored in odps, and calculates the BP value to be adjusted by the optimization example through the algorithm model calculation.

Execution: the database management and control platform specially implements the BP optimization instruction, and schedules the specific execution time window of the optimization operation, and executes the operation efficiently under the premise of meeting the release constraints.

Stability challenge

Because the operation of reducing the BufferPool configuration is an operation that will reduce the stability, if the operation is improper, it will bring extra work to the DBA and cause business failure. Therefore, the project has been challenged and pressured by DBA and stability-related students in BU.

We have mainly taken a number of measures to ensure business stability, including:

1. Algorithm model: adjust the sensitivity coefficient α α of the mapping relationship between BufferPool size and cache hit rate, so that the adjustment result is more conservative.

two。 Online adjustment: we only adjust the online tunable parameters to avoid MySQL crash due to the MySQL kernel.

3. Grayscale strategy: the scale parameter adjustment of the whole network adopts a strict grayscale strategy. At first, the service DBA adjusts a small number of instances according to the BP size given by the algorithm to ensure the stability of the business. Then, through the whitelist mechanism of more instances, the BufferPool size is automatically adjusted only for the instances in the whitelist, and grayscale is carried out on the instances in the specified range. Finally, on the non-core instances confirmed by the business DBA, large-scale and automatic operations are carried out strictly in accordance with the release process and control process, and the number of operations per operation is strictly limited.

4. Process closed loop: from data collection, BP size decision, automatic BP adjustment to adjusted quantitative tracking, and rollback mechanism, the whole process is closed loop, and the adjusted statistical analysis report is issued every day.

Achievement

After algorithm exploration and end-to-end automatic Buffer Pool optimization process construction, the whole network of FY2019 Group finally optimized ~ 10000 instances, reducing the overall memory usage from 217T of memory to 190T of memory, saving 12.44% of memory resources (27TB).

The future

On the business side, FY2020, on the one hand, we will continue to expand the scope of BP optimized instances to save more memory resources; on the other hand, we will continue to optimize the algorithm model and export it to the public cloud through HDM products to provide database instance specifications for cloud users.

In the technical aspect, we will expand from Buffer Pool parameter optimization to other database performance parameter optimization, explore the relationship and influence of multiple performance parameters, establish a model based on database load and performance relationship, and optimize the unified database parameters from the perspective of the whole database example.

Want to know more about databases and cloud technologies?

Come and follow the official websites of "data and Cloud", "Cloud and Enmo", "official account" and "Yunhe Enmo". We look forward to your learning and progress together!

Data and cloud Mini Program "DBASK" online Q & A, at any time, welcome to understand and follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.