Reprint an old article, "Ali researcher Zhang Rui: putting a database into a container is no longer a myth." 04/19 Update SLTechnology News&Howtos

Reprint an old article, "Ali researcher Zhang Rui: putting a database into a container is no longer a myth."

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Zhang Rui, head of database technology team of Ali Group, researcher Alibaba, Oracle ACE. The general head of database technology for Singles Day has twice served as the general person in charge of technical support for Singles Day. Since joining Alibaba in 2005, it has been leading the continuous innovation of Ali database technology.

Recently, Zhang Rui, a researcher from Alibaba Group, delivered a keynote speech entitled "thinking about the Future-oriented Database Architecture" at the 2017 China Database Technology Conference held in Beijing. This paper mainly introduces the idea and experience of Ali database technical team in building Ali's next generation database technology system, hoping to introduce Ali's achievements, pitfalls and future-oriented thinking to the attendees, so as to contribute to the development of database technology in China.

Full text of speech:

Let me first introduce myself. I have been working on databases since I joined Ali in 2005. Today's topic is that I have recently been thinking about Alibaba's next-generation database system. I would like to share it with you here. I hope to be able to throw a brick to attract jade. If you can get some experience and some ideas after I share today, combined with the actual scene you face, the goal of my sharing today will be achieved.

Today I will talk about the following aspects: first of all, we will talk about a little innovation in the kernel, how to achieve flexible database scheduling, thinking about intelligence, and finally, we have stepped on the hole and see the direction of the future.

The problems faced by the database in Ali scenario

First of all, the database technology used by Alibaba's earliest generation is Oracle. Later, we all know that one thing is to go to IOE. In the process of going to IOE, we have moved towards the era of using open source databases. Today, this era has passed, and this process has lasted for about five or six years. Alibaba has a well-known branch of open source MYSQL-AliSQL, and we have made a lot of improvements on it. So I've listed some improvements on AliSQL, but I don't really want to talk about it today. I want to talk about the future-oriented next-generation database technology and database architecture.

I think this is true, because today's Alibaba is after all a technology company, so very often we will look at, for example, Google or some big Internet companies, where do their technological innovations come from? It comes from a problem. That is to say, all of you here today are the same as me, what is the problem you face in the scene, and how deep you look at the problem determines how much innovation you create today.

So today we take a fresh look at what the problems facing Ali are. I believe all of you here must have the same idea. The problems faced by Ali are not necessarily your problems, but I would like to say that through Ali today, and what we have done after seeing these problems, we look forward to giving you a reference. I hope you can also see what the problems you are facing and how you will think.

We can see that Alibaba's application is actually very different from that of Facebook and Google. We also talked to them and found that it was really different from their business scene. First of all, our main application is transactional. What are the requirements of these applications? you will see these points (see picture). Let's mainly talk about our thinking.

Today, the high availability and strong consistency of data is very important. The problem caused by data inconsistency is very huge. We also use Taobao, which is also a user of some Alibaba services. Every user, even my parents, will pay attention to these things.

Second, the storage cost is very high today, all data centers are already using SSD, but the data storage cost is still a very big problem for a large enterprise, which is a real money problem.

In addition, as mentioned just now, data have a life cycle, so data, especially transaction data, are obviously cold and hot. People must seldom look at their purchase records on Taobao a year ago, but the current purchase records will be checked. The system needs to read it and update it frequently.

Another feature is that Ali's business is still relatively simple today, for example, we want to achieve the extreme in OLTP performance. There is also a unique point of Alibaba, that is, what is Singles' Day in essence, which essentially creates a very big hot effect in technology. What kind of demand does this put forward for us? Demand is an extremely flexible ability, the database is actually very lacking in this direction, how to achieve elastic scaling of the database is very difficult.

Finally, I would like to talk about DBA. Many people here today may be DBA. I would like to talk about what kind of thinking Ali got in the direction of intelligence. We have huge amounts of data, and we also have a lot of experienced DBA. But how can these DBA complete the next step of transformation, and how can they not become the bottleneck of the business? How to achieve self-diagnosis and self-optimization of database. This is the problem we see, and finally I will share my thoughts on this.

Ali's thinking on the Direction of Database Kernel

Let me first talk about our thinking on the database kernel. First of all, I have great respect for domestic database manufacturers. Anyone who improves on the kernel knows that it is not easy for each function to be written line by line of code. I would like to express my respect for domestic database manufacturers, including these technicians. What I want to talk about today is my first time at a domestic conference. First of all, I will talk about AliSQL X-Cluster. X-Cluster is a three-node cluster built on AliSQL. We have introduced the Paxos consistency protocol to ensure that MySQL becomes a cluster, and this cluster has a series of characteristics such as strong data consistency, remote deployment, and tolerance for high network latency.

Today, many databases are associated with Paxos, such as the Spanser database of Google, as we all know, but before, we didn't particularly think about the relationship between database and Paxos. In fact, there was no relationship before, but today's database needs to use Paxos protocol in several places. First, we need to use Paxos to elect. Especially in high-availability scenarios, you need to uniquely elect a node as the primary node, which requires the use of Paxos The second is to use Paxos protocol to ensure the strong consistency of data in the database without shared storage, that is, how to ensure strong consistency and high availability among multiple nodes.

So Paxos is widely used in database architecture design. Today, many exhibitors, including Goolge Spanser, are also using Paxos protocol and database to do it. So the same is true of the three-node cluster of AliSQL, which uses the Paxos protocol to become a cluster with strong consistency of data. Let me briefly explain what the Paxos protocol does in the database.

In essence, Paxos is also a common technology nowadays, and everyone is engaged in databases. To put it simply, Paxos protocol is used in our database, that is, after a transaction group is submitted to one node and landed, it must be landed on multiple nodes at the same time, that is to say, originally, writing only needs to be written to one node, but now it needs to be written to another node across the network. This node may be remote. It may also be another city in the world, which needs to go through a very long network delay, when some core technologies are needed.

What is our goal? First of all, there is no way to resist the physical delay. In the past, operations on the database were only submitted locally, but now the database is deployed globally, offsite, or even across the network. There is no way to overcome this delay characteristic. But what can we do in this situation? In the case of delay growth, as far as possible to ensure that the throughput does not decline, the original amount of QPS and TPS can be guaranteed, as long as the project is done well can be guaranteed, but the delay will certainly increase.

This is also the description of "my latency is very high" that we often see in Goolgle Spanser papers. In this case of high latency, how to write a good application to ensure availability and high throughput is another topic. We have been accustomed to the concept for a long time, that is, the database must have a very low latency, and high latency will lead to problems with the application. In fact, this problem will take another space to talk about, that is, the application must adapt to this high-latency database system. Of course, the use of Batching and Pipelining technology, in essence, is a general engineering optimization, so that cross-network multi-copy synchronization becomes efficient, but the delay will certainly increase.

In fact, we all know that the database needs to do three copies or three nodes, in essence, in order to achieve strong consistency of data, and everyone is making efforts in this direction. For example, Group replication, which was launched by Oracle some time ago, is also a three-node technology. The difference between X-Cluster and it is that our initial goal is to cross cities. At the beginning of the design, we thought that this node must be deployed over a very long distance. This goal put forward at the beginning of the design results in great differences in design, engineering practice, including the final performance.

Here we have also made some comparisons between X-Cluster and Oracle's Group replication. We are better than them in the same city environment; the difference is even greater in remote scenarios, because we were originally designed for remote scenarios. As you may know, Ali has been talking about the concept of living in different places, that is, how to do more work in different places between IDC, so at the beginning we designed to do it for remote scenes.

This is a typical X-Cluster architecture diagram of how to do it in a multi-active scenario. This is a typical architecture of 3 cities, 4 data and 5 logs. If you want to simplify and consider the data storage cost, you can actually achieve 3 data and 5 logs. In this way, you can ensure that any failure of city-level, machine room, including stand-alone computers can be avoided, and there is no data loss. We can do this today, and we can ensure that the data is zero-loss and strongly consistent. The data at any point will at least be written to the database of another city's data center, which is the goal of our X-Cluster design at the beginning, and this is also a typical remote and live architecture.

Let's talk about a small, but very practical innovation, which may be of interest to all of us, and that is X-KV. Let's also say that all of our next-generation technology components start with X. This X-KV is based on the improvement made by the Memcached plugin of the original MYSQL, achieving very high performance. Everyone may know the Memcached plugin of MySQL, and you can directly access the data in InnoDB buffer through the interface of Memcached plugin. The read performance can be very high. For everyone, or for the so-called architect, or what is the significance in the design process?

That is, caching is not needed in many scenarios, because the database + cache structure is basically a common scenario for all businesses, but the problem with caching is that the cache and the data in the database are always inconsistent. A synchronization or invalidation mechanism is needed to do this. The problem of reading after using X-KV can basically be solved. This is because a piece of data can basically achieve the same ability as the original access cache as long as it is accessed through this interface, or in most cases it does not need to be cached.

The second is that it reduces the response time of the application, the response time of the original SQL access will be higher, we have made some improvements above, the original Memcached plugin plug-in has some restrictions on supporting data types, including poor support for some index types, so we have made improvements, which everyone can use, if you use this way, basically many caching systems are not needed.

The third thing I want to talk about is how to solve the separation of hot and cold data. We naturally use the framework of MySQL. Here we directly take the big picture of MySQL to show it. You can see that MySQL essentially has a Client above, a Server in the middle, and a storage layer below. There can be a variety of engines in the storage layer, so different features can be achieved through different engines. The most commonly used engine today is the InnoDB engine, and the characteristics of each storage engine are essentially caused by its structure. For example, InnoDB uses the structure of B + Tree, which brings the feature that reading and writing are relatively balanced, because it is really mature after so many years of development.

For example, we now choose RocksDB, this is because we have some cooperation with Facebook on RocksDB, that is, introduce it to MySQL, its essential structure is LSM Tree, the benefits of this structure include write-friendly, good compression and so on. Introducing it into our reform is not just introducing a structure, but today we use these two engines to solve the problem of data separation today. We have also had some communication with Facebook. RocksDB is not that stable and good today, but it is very effective as a supplement to the InnoDB storage engine.

Especially in the context of a stable database, users do not have much sense of whether their data is hot or cold today, because as you may know, you have had some data separation before, but for the application side, you need to pour the data from a certain storage into a certain storage, and then delete it. Or it's easy for DBA to go to the business developer and say that you don't have enough storage space and take up a lot of space. Can you delete some data or import it into a lower-cost storage engine. We do it all the time, and to put it bluntly, I'm sure we've all done it before.

However, with this dual-engine structure, the high compression ratio of RocksDB, especially in the case of OLTP row storage, can bring us greater benefits. So we can combine the two engines under the MySQL feature, and we can take advantage of cheaper architectures, especially LSM Tree, which is friendly to cheap storage media because he writes sequentially. These are some of our thoughts on the database kernel today.

Why should the database realize flexible scheduling

In the second part, I would like to talk about the flexible scheduling of the database. We all know that Ali Singles Day is the biggest challenge for us. The biggest challenge for us is that the application may already be easy to do flexible scheduling, including cloud uploading, elastic capacity expansion and capacity reduction. But the database is really difficult, and we have also been exploring it for some time. Today we will share our thoughts with you.

I heard a lot of people say that containerization of the database is a pseudo-proposition. Why should we do containerization? why should we put the database in a container? Second, there are also some new technologies, such as as mentioned by the sharing guest just now, it is possible to access the storage remotely through the network. But we always think about it, don't think about the possibility of database flexible scheduling, if the database wants to achieve flexible scheduling, what is its premise?

First of all, let's think about the flexible scheduling of the database as simple as the application, so what does the database have to do? I think there are two major premises that must be done: 1, it must be put in a container; 2, computing and storage must be separated. Because if computing and storage are not separated in nature, the database basically has no way to schedule flexibly. We all know that computing resources are easy to move, but storage resources are basically difficult to move in a short period of time, so it is very difficult to be flexible. So these are the two basic conditions.

If you also encounter this kind of problem in our scenario, it is not a pseudo-proposition. I think whether this thing is reasonable or not is more often not whether the technology is correct, but whether we need it in your scene, so today we did two things. The first is to put it in the container. Our current physical machines, VM and Docker all support it, and one layer will shield the complexity of the container. The database must be put in a container. Applications are often placed in containers for deployment, but we put the database in the container for scheduling, because the database itself does not have much release and does not need to be published as frequently as applications. After containerization, the database can be mixed with other containers on one physical machine.

We do DBA have some traditional point of view, for example, the database server must not run applications, the database must not use containers. I don't know everyone here, every time someone or your boss asks you this question, do you always reject him immediately and say, "the database must not do this?" but today you may be able to tell your boss that you can give it a try.

Storage computing separation, the earliest time to do a database, storage and computing is actually separate, with an Oracle database, with a SAN network, followed by a storage, storage and computing itself is separate, with the SAN network in the middle. Then evolve to use Local disk, SSD disk, and PC as server. In the future, we have to return to the structure of separation of storage and computing. With the development of today's network technology, not to mention proprietary networks, let alone the general 25G network, as well as the use of new technologies such as RDMA and SPDK, we have the ability to separate storage and computing, and the conditions for the separation of database storage and computing are already in place.

Today, we have seen a large number of optimized features on the database that can reduce IO, turn discrete IO into sequential IO, and be friendly to lower-level storage. In terms of storage cost, shared storage will greatly reduce the cost because the storage fragments will be greatly compressed, because there is 30% or 50% of the free space on each machine, and it is difficult for other machines to use. There is a great benefit when you turn these fragments into a Pool today.

In addition, if the database adopts the separation of storage and computing in the future, it will break the current mainstream database one-master-one-standby architecture, in which at least half of the computing resources are completely wasted. whether or not your backup database is used for reporting or other applications, it is basically wasteful. If shared storage can be achieved, it will be a huge benefit. This is our thinking on scheduling. Tomorrow, a classmate Ali will give you a detailed introduction on containers and storage resources on this topic. I will only give you a general idea today.

What is the future work of DBA?

Finally, let's talk about DBA. I was also talking about it just now. I'm talking about moving from automation to intelligence. Internally, we are moving from self-service to intelligence. I don't know if everyone is troubled. The speed of business development is much faster than the growth of the number of DBA. If you don't have the latter, I don't have to talk about it, but if you do, you can listen to our thinking in this area. We also encounter the same problem. How DBA will develop, what should be done next in automation? many people say whether DBA will be eliminated. At least after we have thought through these issues, Ali's DBA is not obsessed with this matter, so I would like to share this thought with you today.

First of all, we did something today. We gave up the original train of thought. What is the original train of thought? In the earliest days, we needed DBA to take a look at every SQL we launched. In the second stage, we built a system that estimated whether the performance of each SQL was good or not before it was launched, and only if it was good. What is the biggest change and thinking that we feel today? All optimizations based on a single statement do not make much sense, because only based on large data and calculations can it become an intelligent thing, otherwise it is all rule-based.

It is difficult for a rule-based system to have a particularly long-term vitality, because there are rules that can never be written. We have also made such an attempt, when some SQL came in, the system had to make some judgments about it, and finally found that there were never-ending rules. So later, we found another direction. I believe that all of you here today, regardless of whether your company is big or small, have a monitoring system. Let's start from this monitoring system. How to turn a monitoring system into an intelligent optimization engine? let's not say the brain here, but the engine. What will this engine do?

First of all, we have abandoned the optimization based on a single SQL because it doesn't make sense, the DBA doesn't review a single SQL, and it doesn't make much sense for the system to see a single SQL. Our first scene today is about a lot of data. What is a lot of data? We set out from our monitoring system, put forward the first goal, to collect each running SQL, not sampling, but every one. It is a huge pressure on storage in larger systems, because it produces a large number of by-products.

Just like the timing database generated by Facebook when doing monitoring products, the by-product we produce today also brings pressure on the timing database, which I will not expand on today. We collect the operation of each SQL because we have made improvements in the kernel to collect the source, path, and all the information of each SQL in the database. If the monitoring index is pressed to the second level, the index of all monitoring items must be at least seconds, which is what our existing technology can do.

In addition, we combine application-side logs with databases. It turns out that when doing the database, the app shouted "is there a problem with the database?" DBA said there was no problem. However, from the application side, we can see that there are many problems in the database, including application error reporting, including response time, and combining the application error reporting with the database, especially the database errors reported in the application, as well as the whole link.

Response time, only the response time of the application side can really measure whether a database is good or not, not how the database itself is, what Load is low, how much CPU utilization is. When all these data are collected, these large amounts of time series data are called by-products, which puts great pressure on our entire link. We do the entire monitoring system platform students feel that the days will not survive, because the original storage system can not support, the analysis system can not support, the original platform can not be calculated. Therefore, considering from this goal, great improvements have been made based on the link, including how to achieve cheap storage and how to analyze in real time, which is the requirement of storage and computing.

Our goal today is clearly stated within Ali. We hope that within two to three years, most of the work of DBA will be replaced. I don't know if we can do it in two to three years. I hope we can do it. In fact, today's DBA is like this. The work of DBA is essentially divided into two categories. The first category is operation and maintenance, but in essence, operation and maintenance is relatively easy to solve. No matter it is done with cloud for small companies, large companies basically have some automated operation and maintenance systems.

But the most difficult thing to solve is the diagnosis and optimization I just mentioned. I also know a lot of companies, such as Google and facebook. I said, why don't you have DBA? They say we don't have DBA, we don't have a particularly traditional domestic DBA for diagnostics and performance optimization, and there are very few responsibilities. So this thing hopes to do it.

Finally, with data and computing, we feel that the future direction may be machine learning, which is now more popular. There will also be an Ali alumni meeting to share this topic tomorrow. I will not say much about machine learning here. Because I think we are also getting started, there is nothing worth talking about, but we think this design is quite promising. It's a promising thing as long as you accumulate enough data and calculations.

Our other thoughts on the future of the database

On the last page of PPT, I would like to talk about some of my understanding of the entire database system in vernacular.

Today, there is no one storage or database in a company that can solve all the problems. Today, more and more trends see that the diversity of data storage is inevitable, because row storage has the advantage of row storage, column storage has the advantage of listing storage, computing has the advantage of doing calculation, analysis has the advantage of doing analysis, and doing OLTP has the advantage of OLTP. Don't expect, or it's hard to expect a system to do everything. It may not be very good for me to say this, but it is really difficult, but what do we see? That is, every technology or product can do one thing best in production, and you can use it to solve your problems.

This goes back to the previous question, we have also gone through some detours, there are more and more data storage types, what if we use this today and that tomorrow? Our operation and maintenance staff cannot handle it, and this support is very painful.

So today we propose to establish two platforms: 1, to establish a supporting platform, which shields the complexity of the lower storage as much as possible, and provides a unified interface and service to the upper layer; 2, to establish a service platform, a clear R & D-oriented platform, through which R & D personnel can use database services directly. I see that many companies mix the operation and maintenance platform with the platform developed by DBA, but Ali's idea is that the support platform and the service platform are two hierarchical platforms. The support platform is below, and the upper service platform serves all developers. Developers on this platform can see what database I use, how their performance is, and what I can do above, so that they can save a lot of DBA manpower.

There is a joke within us that "platforms that do not save manpower and technologies that do not save costs are all hooligans." how to say this sentence? That is to say, our automation systems, especially large companies, are building more and more, and the end result is that people have no ability. I don't know if you have this problem. This is my last point, the paradox of automation systems. Did something happen to everyone in every company when you were working on an automated system today? The opposite is happening in Ali, that is, the human ability has been weakened.

The paradox of this automation system is that we accidentally see it. When talking about the autopilot of the aircraft, because the autopilot is good enough, when there is an emergency problem, the pilot does not have enough ability to deal with the emergency situation. This is the paradox of the automation system.

By comparison, we have done a lot of automation systems today, and as a result, people only know how to click the system, and the system is doomed as soon as it gets stuck. Many secondary failures occur when the system is stuck, and people who get stuck can't handle it. What should we do? This is the question to think about today, and in the process, all the people who lead the team or who are in the system today have to think about it, and we are also facing this problem directly, so that the ability of people and the ability of the system can be combined. This is another topic. I can't give answers today, but we should pay special attention to these questions.

Do not believe those myths that have expired. Database storage and computing can be separated, and databases can also be placed in containers. But you really need to see what the original myth or the problem behind it is. In fact, there may already be a solution, so everyone in this room, when your boss, CTO or someone asks you, "can you do this?" I want you to tell him "I can!"

There is a saying within us that where did our DBA read an article saying, what is the concept of DBA? I was particularly impressed. A developer replied at the bottom that "DBA is a group of people who are always saying no." that is, we can't do this or that. Today, I think we will become a group of people who can always say "yes" and "yes" in the future. Thank you!

Transferred from: http://www.sohu.com/a/143113774_629652

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.