How to understand the hot and cold data separation design of MySQL 04/15 Update SLTechnology News&Howtos

How to understand the hot and cold data separation design of MySQL

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to understand MySQL hot and cold data separation design", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in depth, together to study and learn "how to understand MySQL hot and cold data separation design"!

Brief introduction to the Development of Database

In fact, the growth of the amount of data has been explosive with the development of the Internet, because all kinds of data are constantly being copied to every corner of the Internet after a small number of changes and additions. In order to adapt to the massive growth of Internet data, in the sense of back-end and architecture, the development of database has roughly experienced the process of "single database, single table-> master-slave read-write separation-> sub-table sub-library-> NoSQL-> NewSQL".

At first, we piled all the data in a data table; later, in order to improve performance and increase the ability of data expansion, we adopted the methods of "master-slave read-write separation" and "sub-table and sub-database". The former only needs to do data synchronization between master and slave instances without having a great impact on the existing business, while the latter needs to formulate a reasonable strategy of dividing tables and databases in a way that suits business logic. Later, the emergence of NoSQL broke some of the inherent limitations of traditional relational databases. They have different types, some for high-performance reading and writing, some for massive data storage, and some for scalability of the data structure itself.

Different types of NoSQL solve different problems in different aspects, but today's NewSQL tends to regard the database as a black box service. You can still use it in accordance with the traditional database protocol (such as the traditional MySQL), but the data storage service itself can not only have high read and write performance but also easily scale out. NewSQL is not a new thing, we can think of it as an integrated solution of previously accumulated database technology combined with distributed technology, which makes people who use data services hardly need to think about performance and scalability, and try their best to achieve high availability, high performance and scalability within data services.

"hot data" and "cold data"

After a brief understanding of the development of the database, we will introduce the problems we have encountered in data storage and some business background.

As a meteorological big data service provider, as we accumulate more and more data, we find that we urgently need a unified data path planning and specification at the global level. In many cases, the data we get from the data source not only needs to be distributed to online users immediately, but also needs to be used by internal projects. If it is simply implemented on demand, the data flow will be very confusing. Based on this consideration, we introduce the concepts of "hot data" ("online data") and "cold data" ("offline data"):

"Hot data" refers to the data that needs to be distributed to users immediately, that is, after being crawled from the data source, the data is cleaned and immediately stored to a storage medium that can be quickly distributed (such as Redis) for use by API or directly user-oriented systems. The "hot data" line needs to focus on ensuring the quality of service and stability. In order to ensure the timeliness of the data, it is also a high-priority data in data processing. "hot data" may be temporary or short-term storage, and later data may overwrite existing data.

"cold data" refers to data that does not need to be distributed to users immediately, and may never even be distributed to users as is, but it takes a long time to accumulate so that we can draw a higher level analysis based on it. The typical usage scenario of "cold data" is for the internal data evaluation system to evaluate and analyze the data accuracy, and it can also be used to model the algorithm team. The principle of setting up this data line is not to affect the service quality of "hot data", especially the timeliness and stability, but also to meet the data usage needs of some non-online projects.

In fact, this is not a new concept, many companies that do data services have similar designs, and we just borrowed such concepts according to the characteristics of our business. however, their meaning may be different from that of similar concepts you see elsewhere.

Combined with our specific business scenarios, the "hot data" line has been running effectively all the time, that is, we obtain data from the data source, store it in high-performance storage media as soon as possible, and then distribute it through the HTTP protocol. These data are up-to-date and up-to-date. There are some types of data, and we also need to see the historical changes in the visualization project and be able to do simple aggregation and calculation, which means that the data needs to be accumulated for a period of time. Then we also need some media that can be stored persistently.

Take the actual weather as an example. After collecting the data, we immediately store the latest data to Redis, and for the sake of data accumulation, we also write the new data into MySQL. This is what we did before, but with the rapid expansion of the amount of data, the problem will soon be on MySQL. For a single MySQL table with hundreds of millions of rows up, the operation will become more and more difficult, and a wide range of decimation or data insertion operations may make the whole MySQL unable to provide services, which is unacceptable for online business.

Implementation of offline data Center

After putting forward the concept of "cold data", we realized that those ancient historical data actually need to be stored in the data center pool of "cold data", while online MySQL only needs to retain the recent data. In addition, in order not to change the way existing projects use data and lower the threshold for database users, we need to be compatible with MySQL single table usage protocols, whether for online databases or "offline data" data centers.

Soon we began to consider NewSQL's solution, and TiDB naturally entered our field of vision, which is a perfect solution that can not only be compatible with existing data usage, but also achieve data scale-out, but the cost of building a minimum version of TiDB data cluster is still a bit high compared to our current role of using it as an "offline data" storage center. And our inventory services are basically based on Aliyun, so in the end, we chose PolarDB, a cloud database soon launched by Aliyun. In the meantime, we also studied many other database schemes, such as DRDS, OceanBase, Google Cloud Spanner, Amazon Aurora and so on.

Data synchronization and data expiration

With the offline data storage center, we began to think about how to convert "hot data" into "cold data". At the same time, the online database can automatically expire historical data beyond the time window. In addition, because the internal visualization project also wants to see real-time live data, it is best to get the latest live data soon.

Since it is a real-time data transfer between two MySQL (clusters), it is natural to think that we can do a data synchronization mechanism similar to that between master and slave nodes through binlog. This synchronization can achieve a second delay, which is completely acceptable in real-time. However, this cannot be a simple data synchronization, because offline data cannot synchronize the expired operation of online data. More specifically, we can sum it up as follows: MySQL synchronizes all data addition and data modification operations of the master node from the slave node, but does not synchronize the data deletion operations.

After investigation, we found that Syncer, a synchronization tool provided by TiDB, can achieve this. We only need to specify the DML statement that filters out DELETE in the configuration, as shown below:

[[skip-dmls]] db-name = "weather_data" tbl-name = "weather_now_history" type = "delete"

On the other hand, the data expiration scheme can be completed directly with the help of EVENT and PROCEDURE mechanism of MySQL itself. First, we can create a PROCEDURE to delete the data:

CREATE DEFINER= `weather` @ `% `PROCEDURE `weather_ data`.`del _ old_ data` (IN `weather` int) BEGIN delete from weather_data.weather_now_history where datetime < date_sub (curdate (), interval date_inter day); END

This PROCEDURE function is to delete the data before the date_ interday in the weather_now_ calendar table. Then we create another EVENT:

CREATE EVENT del_old_data ON SCHEDULE EVERY 1 DAY STARTS '2018-12-25 10 ON COMPLETION PRESERVE ENABLE DO call del_old_data 08purl 35.000' ON COMPLETION PRESERVE ENABLE DO call del_old_data (30)

This EVENT calls a PROCEDURE named del_old_data once a day and assigns date_inter to 30. This means that the database deletes data once a day, so that the online database only retains the data of the last 30 days, and the full amount of data is synchronized to the offline data center in real time when the data is written, which is perfect.

Continuous improvement

The above specific business scenarios are more likely to solve the problem of separation and conversion of "hot data" and "cold data", which means that the solution is not universal, and it may no longer be applicable when we encounter other databases or different data usage scenarios.

In addition, in many cases, the distinction between "hot data" and "cold data" is not so clear, and the demand for "cold data" may be transformed into "hot data". We need a mechanism that can be flexibly switched, so that the data source can be crawled only once ("hot data" and "cold data" do not grab separately), and the captured data can flow freely to the "hot data" or "cold data" line. This means that we should do another layer of isolation between data crawling and data storage.

To achieve the isolation between data crawling and data storage, we can adopt the "publish / subscribe mode": to put it simply, the data crawling service publishes the data to the message queue after obtaining the data, and the subsequent storage service arbitrarily subscribes to the message queue for storage, so that the data source only needs to be crawled once, and we can use it as either hot data or cold data. It can even be used as both hot and cold data, and it is very easy to switch. This is one area where the follow-up system architecture can be improved.

In addition, it is not enough for offline data centers to use PolarDB only for the magnitude of data we may generate. We also need lower-cost data storage solutions to store data that is older and rarely accessed that needs to be "archived". At this time, some column-based NoSQL databases may come in handy.

Data governance needs a long-term and continuous process, and we are still groping in combination with our own business scenarios.

Thank you for your reading, the above is the content of "how to understand MySQL hot and cold data separation design". After the study of this article, I believe you have a deeper understanding of how to understand MySQL hot and cold data separation design, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.