PB-level data persistence caching system-- lest 07/06 Update SLTechnology News&Howtos

PB-level data persistence caching system-- lest

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is based on Xu Haifeng's speech at the Ninth China Database Technology Conference on May 12, 2018.

Lecturer introduction:

Xu Haifeng, flower name: big mouth. 10 years of Internet experience. China Literature Group is currently the chief architect and technical expert. Mainly responsible for the architecture and implementation of China Literature Group content center distributed system, distributed storage and distributed computing of massive data. Also responsible for the company's patents, technology and other open source sermons. He has worked as an architect of ctrip International Air ticket pricing engine and 5173 distributed Storage and Computing architect. Over the years, it has been devoted to the research and implementation of middleware such as distributed architecture of the website, massive data storage and computing, and formed a formed technical cognitive and theoretical system. Rich practical experience in large-scale website architecture and distributed systems.

Content summary:

Common caching systems, such as memcahed, generally store data rather than support persistence. Even though the later Redis solved the "hard wound" that the data could not be persisted, it has always been a troublesome question whether the persistence function of the cache system is enabled or not. The main reasons for the struggle are as follows: 1. The performance of the cache system degrades significantly after persistence is enabled; 2. Data persistence is enabled, but the machine downmachine is still unavailable after recovery or the data cannot be automatically updated to the latest version; 3. The main memory is still memory, so the data size needs to be limited by memory, and it is still unable to store data larger than memory, so persistence is only a backup. Persistence is not considered in the design, and it is very awkward to enable persistence; our lest solves these problems from the beginning of the design, and brings more interesting and practical technologies, such as proprietary communication and storage protocols, lock-free multithreading model, and so on.

Body speech:

Today, I mainly want to share with you the persistent cache we use today-lest, which is called cache, but because it is persistent, I personally think it might be better to call it a storage system. It is indeed a KV structure and includes String, List, Map and so on. It is now in use and supports data from 1PB to 2PB.

Speaking of caching, what does caching look like in everyone's impression? In fact, caching and elixir have a lot in common. First of all, they are both designed to solve the problem of "whether it works or not." after use, 99% has obvious effect, 1% has no effect, and it usually fails in a few minutes or hours. And they all take the path of "temporary cure", multi-level cache, from the client to the database.

In addition, the starting point of both is to be stable, fast and long-lasting. Usually, users will use it first, regardless of whether they use it first, and they will also have psychological dependence, and the leaders will be more satisfied with their results. I also feel that I have been promoted from a hard-pressed programmer to a glittering framework teacher.

Although it is difficult for us to refuse to use cache in real life, there will be a lot of problems if we use too much cache, especially when there is a large amount of data and more machines, all kinds of problems will follow. For example, the current cache is basically memory-based, the data will be gone as soon as the power is off, and it is very difficult to recover.

After doing the master and standby, you will find that the standby is of no use. When the host goes down and cuts to the standby, a lot of data is out of sync, and it takes time to synchronize. Two days ago, we also discussed that the main backup does not seem to be of any use, or multi-master is better to use.

The most critical problem is that it is difficult to manage. After using the cache, you can't throw it away. You only dare to increase the amount, but don't dare to reduce it. There are more and more cache servers, which may have changed from one to two, four or eight. Not only is the management cost getting higher and higher, but writing code is also becoming very complicated, because many cache systems are located on the client in order to be fast, so, with each additional machine, all clients have to be configured. Some teams that do well may have a configuration system to complete automatically, but if they do not do well, they need to rerelease the program, and if they are a novice, they will probably write a dead one for you.

Therefore, in the final analysis, we still need to strengthen our health, and in order to put an end to these situations, we have implemented Lest.

The first is cache synchronization, which can make it unaware of capacity expansion. Second, it is easy to get up when the host is down, and the standby machine may need to be topped up a little, but the host must rise quickly, because our visits are basically 700 million to 800 million times a day. If the host goes down and hits the database and the cache is penetrated, it will be almost over. So we adopted the above four strategies to solve this problem.

Our cache content is String, List, and Map. The storage of List and Map is more difficult because it contains structured data. For example, if you want to query data from the second to the tenth in List, Redis can easily do this, but if it is full memory, it is more difficult to store it on disk.

Therefore, we have made some designs to implement it, and the figure above is our overall architecture diagram. On the top right is Tracker, similar to the cache proxy layer that many big factories are doing, and then there is the storage machine. The operating machine will be segmented, such as 256,128 segments, and the data will be allocated to different segments. For the implementation of communication and storage, we use the protocol designed by ourselves.

Load balancing is actually a clich é. One of the biggest features of cache is that key should be customized. Business customization is difficult to do similar to metadata management because it is stored on disk. Our choice is Hash, but the trouble with using Hash is that the hash value changes as the machine increases. Therefore, when we add machines, we will have a trick to increase them in the form of 2 * numbers, such as one to two, two sides to four, and four to eight. The amount of synchronization in this way is the least, 50%. If you change from one to three, then the moving data is 66.7%. If you are using Hash, I suggest that it would be better to expand your capacity by using the 2 * number.

After the data is stored, we need to synchronize. We have the label of the group. Within the same group, we can synchronize the data and back up each other. There is no Slave, and all of them are master. The issue of version will be involved here, which we will talk about later.

The algorithm of load balancing is the evolution from secondary Hash to weighted secondary Hash. At the beginning, we use two Hash to do it. The first time we use Hash to get the segment, and the second time Hash gets which machine. But in fact, the ID generated by the ID generator is not balanced because of the business relationship, which results in a skewed cache storage size of 20% in one machine and 80% in the other.

This situation is also easy to handle, weighting is fine, which is equivalent to consistent Hash, and each machine has a prime percentage weighting like 7%. Why choose prime numbers? This is because the Hash for primes will be more uniform.

We made a 256 × 256 folder on the disk and messed up some files at the disk level. We know that disks are actually scary for small files, because the cost of using SSD is still relatively high, and then we will consider using disks and adding things like B-trees.

At present, considering that there are a lot of small files, SSD is chosen to avoid some of the problems encountered by the disk. If it is 100 million KB of data, there will be only a few hundred KB in a folder after Hash, which will not exceed 3000KB. This pressure is acceptable.

The image above shows a model of data storage, with Head at the front, with a lot of metadata added to the header. For example, if the whole thing is a string, then the yellow part is the real content saved by the client. Len represents the length and Version represents the version. Our whole set is written in C, so the performance will probably be improved more than 10 times. And we also made a monotonous incremental ID generator, its algorithm is actually a derivation of a time vector algorithm, to solve the problem of version control, in other words, which number is the largest, must be the final version. Reserved stands for type, and it may be string, list, or map. At the same time, we will have a reserved place for future expansion.

As a matter of fact, List and Map are similar to String. Just look at the picture and stop introducing them one by one.

This storage might be difficult to implement manually, so we made a HMS object, which is a data protocol that supports all types of data, including int, long, and so on. The most classic use is deployed on the server side, because the server side using C can not reflect like Java, then we will use a mathematical structure to represent the whole content, if we use Key, we can achieve the query efficiency of log N.

The above is the API we do, can support almost all operations, the usage is similar to Redis, Redis can do the operation, lest can also do.

The synchronization architecture is shown in the figure above. If it is in the same group, the two storage will get the data on the Tracker, and then synchronize, which also involves a high-speed IP protocol.

Each record will generate binlog, write down the binlog by category, and then read it on it.

Synchronous replication state, state transfer is 1: 1 greater than or equal to 2, the control is simple, the amount of data transmission is large, while the replication state machine, the control is more complex, the data transmission is actually very little.

The figure above is the result of our first performance test on lest. Some time ago, we applied for a new machine, and we did the test again.

The figure above shows the number of correct responses to requests. The red one is 10 Gigabit server + SSD, and the yellow one is gigabit server + SATA hard disk. Personally, I think there is room for improvement in the performance result, because we do not have enough clients, only ten, at least 20 or 30, can produce its true performance. The current performance data is almost 60% of its real performance.

From the figure, we can see that when the data reaches more than 10K, the performance has been declining all the time. If you compare Redis, you will also find that when the data reaches 10K, the performance of Redis will also decline by 45%. This shows that caching is more in tune with small data.

The picture above is the maximum response time, and I don't know how to get the 822 in the picture. It may be a particularly abnormal value. Apart from this value, the state of other values is relatively stable, and the maximum response time is in milliseconds. It's basically between one or two hundred milliseconds.

The figure above shows the minimum response time, which is basically distributed in a few milliseconds.

The above picture shows the transmission volume, that is, the network card. Obviously, 10 gigabytes has the advantage, because it is already larger than gigabit, it may only reach 80%, and there is still 10% room for growth.

The picture above shows the pressure on the SSD server, and the situation of CPU is similar. If ten servers hit crazily, then the pressure on CPU is about 20%. As the data becomes larger and the processing time increases, the CPU pressure decreases. In the picture, there are also network outbound and inbound, in which the membership is relatively small, while the outbound is relatively large, and if it is Redis, the outbound is even larger.

The advantages and disadvantages of Lest are obvious. It is a thing that eats disk rather than memory. Please refer to the above figure for specific advantages and disadvantages. In comparison, if you need 20 or 30 Redis servers, three lest servers can handle it. In addition, compared with other caches, lest can basically make writing code unaware. In addition, I recommend using SSD, because SSD is still quite cheap, cheaper than memory.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.