ID Generation Strategy-- SnowFlake 07/19 Update SLTechnology News&Howtos

ID Generation Strategy-- SnowFlake

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

First, encounter problems

A project uses the database (MySQL) self-increment ID as the primary key of the main business data. Database self-adding ID is easy to use, automatic numbering, fast, and incremental growth, stored sequentially, which is very beneficial for retrieval.

In a single database environment, the problem of database self-increasing ID is not big. However, in the distributed environment or sub-database and sub-table environment, the database self-increasing ID gradually exposed some problems. For example, in the case of sub-database and sub-table, it becomes difficult to ensure that the ID is unique; if business data such as order number are added to the ID in the database, it is easy to calculate the approximate business volume.

Second, common ID generation strategies.

1. ID is added to the database (mentioned earlier)

2 、 UUID

The core idea of the algorithm is to generate UUID by combining the machine's network card, local time and a random number.

Advantages: local generation, simple generation, good performance, no high availability risk

Disadvantages: long length, redundant storage, unordered and unreadable, low query efficiency

3. Redis generates ID

The generation of ID by Redis can be regarded as an upgraded version of the database self-increasing ID. All command operations of Redis are single-threaded and provide self-incrementing atomic commands such as incr and increby itself, so it is guaranteed that the generated ID must be unique and orderly.

Advantages: do not rely on the database, flexible and convenient, and the performance is better than the database; digital ID natural sorting, paging or sorting results are very helpful.

Disadvantages: if there is no Redis in the system, new components need to be introduced to increase the complexity of the system; the workload of coding and configuration is relatively large.

Given the performance bottleneck of a single node, Redis clusters can be used to achieve higher throughput. Suppose there are five Redis in a cluster. Each Redis can be initialized with values of 1, 2, 3, 4, 5, and then the step size is 5. The ID generated by each Redis is

4. Twitter's snowflake algorithm.

3. Snowflake algorithm

Snowflake algorithm, using 64-bit binary integers. The meaning of the specific number of bits in the binary system is shown below.

One person, no. The highest bit 1 in the binary is negative, but all the id we generate use positive numbers, so the highest bit is always 0.

41 bits, used to record timestamps (milliseconds).

If you use it only to represent positive integers (positive numbers in your computer contain 0), the range of values that can be represented is 0 to 241 − 1, minus 1 because the representable range of values starts at 0, not 1.

In other words, 41 bits can represent 241 − 1 millisecond, and the unit year is (241 − 1) / (1000 ∗ 60 ∗ 60 ∗ 24 ∗ 365) = 69 years.

10 bits, used to record the work machine id.

Can be deployed on 1024 nodes, including 5-bit datacenterId and 5-bit workerId

12-bit, serial number, used to record different id generated in the same millisecond.

The largest positive integer that can be represented by 12 bits (bit) is 4095, that is, the 4096 numbers 0,1,2,3,4095 can be used to represent the 4096 ID sequence numbers generated by the same machine at the same time (milliseconds).

Most people know this algorithm, but Twitter has also done a lot of engineering implementation using zookeeper. If you are interested, you can see https://github.com/twitter/snowflake.

Intercept the main file directory of the project on git

There is a passage in the README.md file of git project.

We have retired the initial release of Snowflake and working on open sourcing the next version based on Twitter-server, in a form that can run anywhere without requiring Twitter's own infrastructure services.

Twitter stopped maintenance of the project several years ago, and the new version has not been released. Fortunately, the existing version of the core algorithm has been able to meet the conventional requirements.

Of course, snowflake has many advantages as well as disadvantages.

Advantages:

The number of milliseconds is high, the self-increasing sequence is low, and the whole ID is increasing.

Do not rely on databases and other third-party systems, deployed in the way of services, higher stability, the performance of generating ID is also very high.

Bit bits can be allocated according to their own business characteristics, which is very flexible.

Disadvantages:

Strongly dependent on the machine clock, if the clock on the machine is dialed back, it will cause the number to be duplicated or the service will be unavailable.

Strong dependence on the clock is fatal in some cases. I have personally encountered a situation in which the server is not synchronized within a short period of time when the server has just been restarted, resulting in a problem with the generation of ID!

4. Some improvement strategies

1. Meituan Leaf's perfect scheme.

Meituan Leaf solves these problems well. See "Leaf-- from Meituan Dianping's distributed ID Generation system".

There are two points at the core of Meituan Leaf's plan.

(1) automatic lease of workerId based on zookeeper

(2) the problem of clock callback is solved by algorithm.

Meituan Leaf is currently open source software, which can be downloaded from https://github.com/weizhenyi/leaf-snowflake

2. An unrigorous but low-cost implementation of a candidate

In my interview, the method proposed by a candidate is also interesting (although this method is not rigorous).

Set an integer variable workerNum in redis with an initial value of 0. When the client starts, read the variable in redis, use workerNum24 as the value of worker, and then set the workerNum+1 in redis.

In the case of a small number of idworker, there is generally no workerId duplication in this solution (because with the iteration of the business, the idworker is usually restarted after a period of time due to business deployment). If the R & D resources are very limited, and you want to use snowflake, you can consider this approach.

3. The solution of hash sub-database in personal project.

In practical use, sometimes ID needs to support sub-database and sub-table, which is not supported by the default implementation of snowflake. In the case of small volume of business, most of the id serial numbers generated by snowflake are 0, and the conversion to decimal will be even. Using this id to divide the library by taking the module hash is obviously uneven.

What if there is such a need? You can consider using ID timestamp to achieve uniform distribution.

(1) the sub-library and sub-table logic uses the timestamp part of ID as the module. This method requires converting decimal ID to binary, then shifting, and then calculating. It's troublesome.

(2) when generating ID, overwrite the Mantissa of the sequence number with the position corresponding to the timestamp. Snippet code, the value of which ensures that the remainder of ID divided by 128is evenly distributed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.