Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Tencent Technology Engineering | look at the rise of database systems in the future through the storage technology in the new hardware environment (with PPT)

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)06/01 Report--

This article is based on the content of teacher Zhu Yuean's live speech at [Gdevops 2017 Global Agile Operations Summit Guangzhou Station].

Reply "Database Technology" in the official account dialog box to get the complete PPT.

Lecturer introduction

Zhu Yuean, Ph. D., Renmin University of China, Senior engineer, Infrastructure Department of Tencent. The main research directions are the theory and implementation of database system, database system under the new hardware platform and TP+ AP hybrid system.

This sharing outline:

Development of modern processors and new types of storage

Database Technology under Modern processor

Database system for New Storage

Summary

You should all have seen "Interstellar", in which there are many shocking scenes. What impresses me is the poem read by the old professor when he encouraged Cooper to explore space and find a habitable planet for human beings: "Do not go gentle into that good night … Though wise men at their end know dark is right …" It means don't walk gently into that good night. For technical people, the underlying hardware of the database system is facing innovation, and we should also explore new technologies to better adapt to the underlying hardware, rather than stay in place, so I use this poem as the beginning of this sharing.

Development of modern processors and new types of storage

1. Modern processor

First of all, let's introduce the development of modern processors and new types of storage. Since about 2005, CPU manufacturers no longer pursue the frequency of CPU and turn to multi-core technology research. A very important reason here is energy consumption and manufacturing process problems, so that they can no longer simply pursue higher frequency.

In the current ordinary servers, processors equipped with dozens of processing cores have become quite common, and the concept of multiple cores has become popular. Then what is the core? Multi-core, which has a special word in English, called many-core, corresponds to a single core, mainly refers to the integration of hundreds of processing cores of processors. Multicore processors may be familiar to you, but have you noticed the memory-wall effect?

Before CPU access a memory, probably only a time cycle of time, now need hundreds of time cycles, access to memory has become a relatively expensive operation, especially in today's large memory and memory computing environment, the effect of memory-wall is more serious, so how to overcome, make the program with locality, has become the most important thing, that is, how to overcome the problem of memory wall.

2. New storage equipment

Have you ever heard of nonvolatile memory? Intel just launched 3D XPoint technology, which belongs to this category, that is, when the memory is powered off, the data will not be lost. It has the characteristics of both disk and memory, and combines the advantages of both, that is, it has the persistent storage characteristics of disk and fast access to memory. Its main characteristics are non-low loss, low latency, large capacity, and read-write asymmetry.

You can imagine that with this kind of hardware, what we system designers need to consider is no longer the problem of the so-called IWeiO, but can focus on high-performance computing. Generally speaking, it is concerned about the scalability of the system.

Introduction of principle

The new type of storage mentioned just now-non-volatile memory mainly has the following four implementations, of which the most mature and promising market is this technology called phase change storage.

Phase change memory: materials can be transformed in both crystalline and non-crystalline states

Spin magnetic moment: change the magnetic moment direction of two-layer magnetic materials

Ferroelectric material: the level of charge formed by the material, binary state

Memristor: a kind of nonlinear resistor with memory function

According to the reverse engineering of foreign engineers, Intel's 3D XPoint uses this technology. Its technical characteristic is that it uses phase change materials and has two states of crystallization and non-crystallization, which correspond to low resistance and high resistance, corresponding to 1 and 0. Compared with the cell structure of the memory, it is mainly composed of the following components: a double-layer heat conductor, then heating the insulator, and a phase change material. When the phase change material is heated by a heater, it will appear in both crystalline and non-crystalline states. Other technical implementations can be discussed in private if you are interested, so I won't talk about it here.

Related parameters

It is mainly the technology of PCM, which is the most important technology at present. Let's take a look at its parameters. Here are some data extracted from the relevant literature, in which we are more concerned with read-write delay, bandwidth, lifetime, and density (capacity).

We can see from the table that the read and write latency of PCM is two orders of magnitude lower than that of Flash, while its life span is two orders of magnitude higher, and its capacity is similar to that of Flash. Compared with memory, its read delay is very close, but there is still a gap between write delay and bandwidth, so at present, it is impossible for PCM to replace memory. Over a period of time, these two kinds of storage will co-exist in the computer architecture.

Another interesting phenomenon is the density of PCM, which is 2 to 4 times larger than memory, and the power consumption is 1% of memory when the system is idle. This is a dazzling feature, especially for data centers, because the memory is constantly refreshed to maintain the data in the memory unit.

The design of DBMS

We all know that the underlying hardware of the system determines the design of the upper software. Now the main contradiction of the database system is the rapid development of hardware and the outdated design ideas of the database system that began in the 1970s. As we all know, disk Imax O is the main bottleneck of system performance in that period, and the designers of the system mainly consider how to design the system better in order to avoid the problem of disk Ihamo. This kind of design idea can also be seen everywhere in our database system. In view of this disk era, the algorithm idea will present quite serious performance problems in the case of large concurrency.

The study was conducted in 2010 by the Carnegie Mellon University database research team, which tested the performance of several open source databases. It can be seen that the performance and scalability of these database systems under multi-core processors are not satisfactory. This paper opened the prelude of multi-core optimization of database system, especially open source software, such as MySQL and PG, began to pay attention to the problem of multi-core scalability in this period. They realized that the system would behave like this in a multi-core environment.

Where did all the time go?

So, where is the transaction execution time of the database system? The following is the conclusion of the MIT study that database systems spend most of their time on cache pool management and logging subsystems, and only about 12% of their time is spent on really useful work.

There are a large number of critical sections in these modules, which are poorly designed and can be parsed by analyzing a code snippet. In the design of the system, it is often a big lock, adding hundreds of lines of code to protect the critical area without thinking. As you have just seen, in this case, when the system is concurrent, the performance of the database system is quite poor.

Database Technology under Modern processor

Have you ever heard of James Gray? In the current database system, the technologies related to transactions are basically put forward by James Gray. Unfortunately, in 2007, he sailed out to sea in a sailboat and then disappeared. The United States dispatched the Marines and didn't find him. As a magical person, he won the Turing Award for his outstanding contribution to database transactions.

In order to overcome the so-called memory wall technology, James Gray once said: RAM Locality Is King, that is, the locality of data and program behavior is the ultimate weapon to overcome the speed mismatch between CPU and memory.

Design principles of RAM-Locality

The following technologies are mainly used to optimize performance in the database, one of which is column storage technology. Column storage technology is mainly used in OLAP, and row storage technology is used in OLTP databases such as MySQL and PG. Why use column storage technology? Because when doing data analysis, there are often wide tables or tables with hundreds of fields, but usually only some fields in the table need to be accessed, such as accessing the sales field, accumulating the sales fields, and doing an aggregation operation. Using column storage, we can better optimize the cache utilization, reduce cache miss, and overcome the memory wall problem.

The other is to design cache-friendly data structures or algorithms. Like the current database, the one-tuple query processing method is very unfriendly to the locality of the program.

What is an one-time tuple? The query statements of the database system are translated into operation trees. Between the nodes of the tree, the operator drives the child node to get a tuple through the get_next function, and the leaf node returns the data after the recursive call. Frequent function calls will cause serious cache miss problems, so now the new OLAP systems use vectorized query execution engine, the upper operators are no longer a piece of data processing, but a batch of data processing, reducing the overhead of function calls and context switching to maximize the locality of data and program instructions. In addition, hash join also divides hash table according to the size of cache to enhance the data nature of data and instructions and reduce cache miss.

An example

This is an example of designing an algorithm that is friendly to Cache. Before PG 9.5, when the system judged the transaction activity or took the system snapshot, it used the transaction start time, transaction ID and so on. These fields are placed in the PGPROC structure, which has 25 members, but only a few members are needed for visibility judgment. Therefore, using this design system will read other irrelevant fields, pollute other cache line, and cause serious waste of cache miss and Cache. So later they put frequently accessed fields such as visibility judgment in another structure called PGXACT. With this patch, the performance benefit is quite objective in the case of large concurrency, as shown in the red data on the upper right of the figure.

Therefore, to solve the problem of Memory Wall, designing cache-friendly data structures and algorithms is a very effective method.

Avoid hot spots and simplify critical areas

In view of the multi-core problem, we also need to avoid hot issues and simplify the critical area. As we often see, when concurrency is large, the performance of the system drops.

This is an experimental result of Microsoft's in-memory database Hekaton. Intercepts a timestamp of the transaction when it is committed. This global atomic operation can cause this performance problem. However, for MySQL and PG databases, performance problems are far from being caused by atomic operations like this.

This is the problem I just mentioned. The design principle of our disk database is to optimize disk IO. Transactions do not need to be scrubbed when committed to avoid random IO. We have a special term called No force, which means that when a transaction commits, you don't have to brush the dirty page, but the system will brush the log down first. Its centralized design can easily lead to performance problems. For update-intensive workloads, the performance problem of this module is even more prominent.

The traditional log writing algorithm (Write-ahead Logging), whether PG or MySQL, is generally divided into three steps: first, acquire a large lock to protect the data structure of shared Log Buffer; then copy the log record to the corresponding log buffer; finally release the lock. This is the most traditional approach.

We have also discussed with the community whether we can abandon the centralized design and adopt distributed logging. Instead of using a log manager, you can use multiple Log Buffer to change log sequence numbers to logical timestamps at the same time.

Before the PG9.4 version, it used such an extensive form as before, adding a big lock, and then messing around in the critical area, such as length calculation, copying logs, some boundary checks, and so on. The code in this critical area is about 300 lines. But then they found that the performance problem of the module was too serious.

The solution is to abstract the log file into a linear length and reserve the location when writing to the log. After the location is determined, release the lock, because the system knows where to write data, there is no need to copy the log and then put the lock. And the log can be copied in parallel between transactions. After optimization, the performance has been improved by about 20% to 30%, and there are corresponding test reports in the PG community.

His second brother's note: more distributed logs can read another article, "optimizing stand-alone database systems with distributed logs will be standard in the future?" "

Lock Manager (Lock request)

The other is the problem of logical locks in the database. The locking in the database is implemented through a hash table, and the maintenance of the table has a lot of lock information. The lock is actually a tag. For example, if you want to add a row lock, take the lock's Table ID and Row ID as key, and then hash it to the lock table. At the same time, mark which type the lock belongs to, whether it is a shared lock or an exclusive lock, etc.

However, in the case of large concurrency or serious conflict, this lock table can cause problems. Because it is a shared data structure, many transactions have to deal with lock tables, and frequent locking and releasing locks cause hot issues.

PG 9.2 adopts inheritance lock technology, which caches the shared table-level lock locally and then passes it between transactions without returning the shared table lock to the lock manager, reducing the interaction with the shared data structure and improving the concurrency of the system.

Database system for New Storage

Next, let's discuss the database system for new storage. The underlying storage has changed, and all aspects of the database system architecture must be changed.

Here I made a summary, NVRAM has six main features: one is byte addressing, its behavior pattern is equivalent to memory, byte addressing, instead of disk Block addressing, and then idle low power consumption, long service life, non-volatile, large storage capacity, fast random read and write.

At present, there are three main ways for NVRAM to connect to DBMS. The leftmost one is our traditional database system architecture, which maintains two Buffer, one is Log Buffer and the other is Data Buffer. Log Buffer is the memory area written in the transaction log set; DATA Buffer is used to cache data pages, and when a transaction accesses data, it first looks for the desired data in this buffer. The Buffer Pull in MySQL refers to this DATA Buffer.

The first way of access is that we can take it directly as a substitute for the disk, and the database system software does not need to be changed. Of course, there can be benefits in this way, because the speed of the underlying Imax O is faster, but it does not maximize its benefits, and the complexity of the software is still there, no more, no less.

The second is as log storage, now we use a lot of machine memory, our Icano basically takes place in one place, that is, to write logs. In order not to lose data, the log must be closed. Using NVRAM as a log storage device can obtain better benefits at a relatively small cost. The second access way is to use it as log storage, and to design the corresponding algorithm and optimize the critical area.

The third way, is the whole system access, the system after a comprehensive transformation, the data on the NVRAM. This can be compared with the second access method, the system no longer maintains the Log Buffer data structure, and is completely abandoned.

Write-behind logging

The research just published by CMU in VLDB 2017 is called write-behind logging, which is a way to access the whole system of NVRAM. Their idea is, write-ahead Logging is a disk-era algorithm, and now I don't have to log first. The problem with logging first is that after the database system goes down, it may take a long time to recover. In order to prevent random IO from flushing the data, it writes out the log sequentially. When the system recovers, you should first get a checkpoint, and then scan the log from the checkpoint, take out the log records, and replay them one by one. This is a time-consuming job when there is a large amount of data.

They propose a new algorithm for NVRAM called write-behind Logging, which writes dirty pages directly to NVRAM when a transaction commits (because NVRAM's random IO is also quite fast). After the dirty pages are scrubbed, write a diary.

The logging they designed is like this, so that instead of constructing any After-image, they can simply write down the transaction commit time interval (Cp,Cd). Transactions less than the time point of CP have been committed, while transactions that fall within this time range (Cp,Cd) have not yet been committed. When the transaction resumes, the system knows that the transaction in this time interval is not committed and is not visible to other transactions. There is no need for the system to perform Redo operations because the data has been persisted. When a system crash recovers, you need to scan the log to establish a time interval at the time of the crash (checkpoints can reduce the number of logs that need to be scanned). Establishing this time window is equivalent to a undo operation.

TPC-C benchmark

They compared the recovery time of systems without algorithms, and we can see that the recovery time of write-behind Logging can probably achieve the effect of immediate recovery. The system can provide services immediately, but this protocol is specially designed for non-volatile memory, and the performance is relatively poor on this disk SSD. On this NVM, the performance of WBL has improved by about 30%.

Summary

Application requirements, industry data and computer hardware are the troika that drives the development of this database system.

In this era of multi-core and memory computing, system designers should pay more attention to the expansibility of the system, pay more attention to the locality of data access and overcome the so-called memory wall problem.

In addition, the emergence of NVM may subvert the system architecture. The emergence of it allows system designers to completely remove their attention from Imax O and focus on system extensibility design.

Dickens said: "this is the worst of times, but also the best of times", this era gives us challenges, but also brings opportunities for database system practitioners!

Quan A

[question 1]: I would like to ask my teacher that we are in the financial industry. I would like to ask you which scenarios will be used more frequently at present? After listening to the sharing, I feel that the performance of this piece of hardware is relatively good.

A: the current application scenario can simply improve IO performance. For example, it can be used to store logs, which can quickly improve system performance. There is the scene analyzed by big data, which is used to improve the IO capability of the system.

[question 2]: which manufacturers have provided the piece just mentioned? What are the specific manufacturers?

A: Tencent is now doing the optimization of this piece. Intel launched the corresponding hardware at the beginning of the year. We have also got this hardware and are also doing related optimization. With the community, we are also doing some communication to improve the scalability of the system.

Follow-up question: does Tencent have support and services on the database side?

A: we haven't put it on the cloud yet, and we are currently doing some internal development. Related products will be launched on the cloud soon.

Follow-up question: a lot of what I said just now is PG. Tencent should also have made some optimizations. Will there be some open source in this respect?

A: yes, it's all open source.

Follow-up: where can I get it?

A: in the PG or MySQL community, these optimizations are submitted to the community.

[question 3]: I just saw that the teacher PPT has a test for TPCC, because now TPCC is basically going to be eliminated. Have you ever done a test related to TPCE and TPCDS?

A: you may have some misunderstanding about this, because the two scenarios are different. There are many benchmark in our database, such as TPCC, TPCH and TPCDS. They are for different scenarios, some for OLAP and some for OLTP. For example, sysbench, which we are familiar with, is actually aimed at OLTP. The one you just mentioned is the application scenario of OLAP.

Follow-up question: are H and DS testing OLAP performance? C is for TP.

A: yes, their application scenarios are different. TPCC is a more convincing measure of what kind of performance a transactional database can achieve.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report