How to realize the 10 billion level Key Storage Scheme of Redis 07/09 Update SLTechnology News&Howtos

How to realize the 10 billion level Key Storage Scheme of Redis

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

The main content of this article is to explain "Redis 10 billion level Key storage scheme how to achieve", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "Redis 10 billion level Key storage scheme how to achieve" it!

1. Demand background

This application scenario is required for DMP cache storage. DMP needs to manage a lot of third-party id data, including the mapping relationship between each media cookie and its own cookie (hereinafter collectively referred to as supperid), as well as the population tag of supperid, the population tag of mobile id (mainly idfa and imei), and some data such as blacklist id and ip.

It is not difficult to store hundreds of billions of records offline with the help of hdfs, but DMP also needs to provide millisecond real-time queries. Because the id such as cookie is unstable, the browsing behavior of many real users will lead to a large number of new cookie generation. Only the data of synchronizing mapping in time can hit the population tag of DMP, and it is impossible to get a high hit through preheating, which brings great challenges to cache storage.

After actual testing, for the above data, the conventional storage of more than 5 billion kv records requires more than 1T of memory. If you need to make high available multiple copies, the consumption is huge. In addition, the uneven length of kv will also bring a lot of memory fragments, which requires a very large-scale storage solution to solve the above problems.

two。 What kind of data is stored?

Personality tags are mainly cookie, imei, idfa and their corresponding gender (gender), age (age), geo (region), etc.; mapping relationship is mainly the mapping of media cookie to supperid. The following is an example of a data store:

1) ID on PC:

Media number-Media cookie= > supperid

Supperid = > {age= > age coding, gender= > gender coding, geo= > geolocation coding}

2) ID on Device:

Imei or idfa = > {age= > age coding, gender= > gender coding, geo= > geolocation coding}

Obviously, PC data needs to be stored in two types: key= > value and key= > hashmap, while Device data needs to be stored.

Key= > hashmap is fine.

3. Data characteristics

Short key short value: where superid is 21 digits: for example, 1605242015141689522 superid is lowercase md5: for example, 2d131005dc0f37d362a5d97094103633bot IDFA is uppercase "-" md5: for example: 51DFFC83-9541-4411-FA4F-356927E39D04

The cookie of the media itself varies in length.

Need to provide services for all data, supperid is 10 billion, media mapping is hundreds of billions, and mobile id is billions.

A billion mapping relationships are generated every day.

Thermal data can be predicted within a larger time window (there are some stable cookie remaining)

It is impossible to predict hot data for current mapping data, and many of them are newly generated cookie.

4. Existing technical challenges

1) memory fragmentation can be easily caused by different lengths.

2) due to the existence of a large number of pointers, the memory expansion rate is relatively high, generally 7 times, a common problem of pure memory storage.

3) although the heat of cookie can be predicted by its behavior, there is still a lot of newly generated id every day (the percentage is sensitive and will not be disclosed for the time being).

4) since the service is required to be within the 100ms under the public network environment (domestic public network delay 60ms is below), in principle, the new mapping and population tags updated on the same day need all in memory, and will not let the request fall into the cold data of the backend.

5) on the business side, in principle, all data are retained for at least 35 days or more

6) memory is also relatively expensive so far, 10 billion-level Key and even hundreds of billion-level storage solutions are imperative!

5. Solution

5.1 Phase-out strategy

An important reason for tight storage is that there is a lot of new data in storage every day, so it is particularly important to clean up the data in a timely manner. The main method is to find and retain hot data to eliminate cold data.

The magnitude of netizens is far less than billions of dollars, id has a certain life cycle, will be constantly changing. So for the most part, the id we store is actually invalid. In fact, the logic of the front end of the query is advertising exposure, which is related to human behavior, so an id will have a certain repeatability in the access behavior of a certain time window (maybe a campaign, half a month, a few months).

Before initializing the data, we first use hbase to aggregate and deduplicate the id of the log, and delimit the scope of the TTL, which is usually 35 days. In this way, we can cut out the id that has not appeared in the past 35 days. In addition, set the expiration time in Redis to 35 days. When there is a visit and a hit, the key is renewed to extend the expiration time, and the natural elimination does not occur in 35 days. This can be effective for stable cookie or id. It has been proved that the method of life continuation is more practical for idfa and imei, and long-term accumulation can achieve a very ideal hit.

5.2 reduce expansion

The size of Hash table space and the number of Key determine the collision rate (or measured by load factor). Within a reasonable range, the more key, the larger the natural hash table space, the more memory will be consumed. Coupled with the fact that a large number of pointers themselves are long integers, the expansion of memory storage is considerable. Let's first talk about how to reduce the number of key.

Let's first take a look at a storage structure. We expect to store key1= > value1 in redis, so we can follow this procedure to store it. First, we use the fixed-length random hash md5 (key) value as the key of redis, which we call BucketId, and store key1= > value1 in the hashmap structure, so that when querying, we can let client calculate the hash according to the above process, thus querying value1.

The process change is simply described as: get (key1)-> hget (md5 (key1), key1) to get value1.

If we pre-calculate so that a lot of key can collide in the BucketId space, then we can assume that there are multiple key hanging under a BucketId. For example, if we hang an average of 10 key under each BucketId, we will theoretically reduce the number of redis key by more than 90%.

There is some trouble to implement, and you have to think about the capacity scale before using this method. We usually use md5 is 32-bit hexString (hexadecimal characters), its space is 128bit, this order of magnitude is too large, what we need to store is 10 billion, about 33bit, so we need a mechanism to calculate the appropriate number of bits of hash, and in order to save memory, we need to use all character types (ASCII code between 0x127) to fill, instead of HexString, so that the length of Key can be reduced to half.

Here are the specific implementation methods

Public static byte [] getBucketId (byte [] key, Integer bit) {

MessageDigest mdInst = MessageDigest.getInstance ("MD5")

MdInst.update (key)

Byte [] md = mdInst.digest ()

Byte [] r = new byte [(bit-1) / 7 + 1]; / / because only 7 bits in a byte can be represented as a single character

Int a = (int) Math.pow (2, bit%7)-2

M d [r. Length-1] = (byte) (m d [r. Length-1] & a)

System.arraycopy (md, 0, r, 0, r.length)

For (int iTuno Bandi)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.