How to talk about the practice and Application of distributed ID 04/19 Update SLTechnology News&Howtos

How to talk about the practice and Application of distributed ID

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

The content of this article mainly focuses on how to talk about the practice and application of distributed ID. The content of the article is clear and clear. It is very suitable for beginners to learn and is worth reading. Interested friends can follow the editor to read together. I hope you can get something through this article!

In many scenarios in the business system, non-duplicate ID needs to be generated, such as order number, payment order number, coupon number, and so on. The editor will introduce the reasons for the emergence of distributed ID and four commonly used distributed ID implementation schemes in the industry, and introduce in detail the implementation of two of them as well as their advantages and disadvantages.

Why use distributed ID

With the growth of the amount of business data, more and more data are stored in the database. When the space occupied by the index exceeds the available memory, the data will be found through the disk index, which will greatly reduce the speed of data query. How to solve such a problem? Generally, we first solve the problem by sub-database and sub-table. After sub-database and sub-table, we can not use database self-increasing ID as the unique number of data, so we need to use distributed ID to do unique number.

Distributed ID implementation scheme

At present, there are four main implementation schemes for distributed ID in the industry:

UUID: ID generated using JDK's UUID#randomUUID ()

Atomic self-increment of Redis: ID generated using Jedis#incr (String key)

Snowflake algorithm: 64-bit Long ID composed of timestamped machine number and concurrency within milliseconds

Segmented step: read an available range of ID from the database in step

Let's summarize the characteristics of these schemes:

Scheme sequence repeatability availability deployment method available time UUID disorder through multi-bit random characters to achieve a very low repetition probability, but in theory it can be repeated all the time using JDK to call permanent Redis monotonously increasing RDB persistence mode There will be repeated Redis downtime after the Jedis client call permanent Snowflake trend will not repeat the clock callback will not occur and when the callback time exceeds the waiting threshold, the integrated deployment will not be available, and the 69-year segmented step-by-step trend of cluster deployment will not be repeated if the database is down and the ID within the acquisition step is not available after the integrated deployment, cluster deployment is permanent

We know more about the usage and implementation of the first two implementation schemes, so we will not repeat them here. In this article, we will introduce the Snowflake algorithm and the segmented step size scheme in detail.

The Snowflake algorithm can be used after the machine number is assigned, and does not rely on any third-party services to achieve local ID generation. The fewer third-party services you rely on, the higher the availability, so let's first introduce the Snowflake algorithm.

Snowflake algorithm

The decimal range of long integer numbers (that is, Long numbers) is-2 ^ 64 to 2 ^ 64-1.

Snowflake uses unsigned long integer numbers, that is, 64 bits from left to right, but the first bit is not used. Therefore, 63bit's long integer unsigned numbers are used in Snowflake, which consists of three parts: timestamp, machine number, and concurrent sequence number in milliseconds:

Timestamp: the difference between the current millisecond timestamp and the new era timestamp (the so-called new era timestamp is the time when the application starts to use Snowflake. If you do not set the new era time, the timestamp is calculated from 1970 by default, and the available time of the Snowflake can be extended by setting the new era time. The conversion of 41-bit 2 to decimal is 2 ^ 41, divided by (365 days * 24 hours * 3600 seconds * 1000 milliseconds), which is about 69 years, so it can be used for up to 69 years.

Machine number: the conversion from 10-bit binary to decimal is 2 ^ 10, that is, 1024, which means a maximum of 1024 machine nodes can be supported.

Concurrency sequence number in millisecond: the conversion from 12-bit binary to decimal is 2 ^ 12, that is, 4096, that is, the concurrent acquisition of ID on a machine node in one millisecond can support up to 4096 concurrency.

Let's take a look at the use of each segment:

Binary segment [1] [2, 42] [43, 52] [53, 64] indicates that the highest symbol bit does not use a total of 41 bits, is a millisecond timestamp bit with a total of 10 bits, is a machine number bit with a total of 12 bits, and is a concurrent sequence number within milliseconds. If the timestamp of the current request is the same as that of the last request, add one to the concurrent sequence number in millisecond

So what does the ID generated by Snowflake look like? Let's give you a few examples (suppose our timestamp is 2020-12-31 00:00:00):

Time machine number millisecond concurrent decimal Snowflake ID2021-01-01 08purl 33purl 111104910313635881062021-01-02 13purl 112259238877306962172021-01-03 21VO11409793654796289

Snowflake can be deployed in three different ways: integrated distributed deployment, central cluster deployment, and directly connected cluster deployment. Let's introduce these deployment methods respectively.

Snowflake integrated distributed deployment

When there are few application nodes using ID, such as less than 200 nodes, it is suitable to use integrated distributed deployment. After each application node determines the machine number at startup, the runtime does not rely on any third-party services and generates ID locally using timestamps, machine numbers, and concurrent sequence numbers within milliseconds.

The following figure shows the process of obtaining distributed ID by introducing jar package into the application server. Each application server node that uses distributed ID is assigned a unique machine number within the topology network. The management of this machine number is stored on MySQL or ZooKeeper.

When there are many machine nodes using distributed ID in the topology network, such as more than 1000 machine nodes, it is not appropriate to use integrated distributed ID, because the total number of machines is 10 bits, that is, a maximum of 1024 machine numbers are supported. When there are more than 1000 machine nodes, you can use the central cluster deployment method described below.

Snowflake central cluster deployment mode

Central cluster deployment requires the addition of ID gateways for request forwarding, such as the use of nginx reverse proxy (ID REST API Gateway in the following figure).

After networking with the ID gateway, the application server requests the ID gateway to obtain the distributed ID through HTTP or RPC. In this way, compared with the integrated distributed deployment above, it can support more application nodes to use distributed ID.

As shown in the figure, the machine number allocation is only assigned to the ID Generator node node in the following figure, and the application node does not need to assign the machine number.

The use of central cluster deployment requires the introduction of a new nginx reverse proxy as a gateway, which increases the complexity of the system and reduces the availability of services. Then let's introduce a directly connected cluster deployment method which does not need to introduce nginx and can support more than 1000 application nodes.

Snowflake directly connected cluster deployment mode

Compared with the central cluster deployment, the directly connected cluster deployment can remove the middle ID gateway and improve the availability of services.

When using the ID gateway, we need to configure the service address of ID generator node in the ID gateway. When using directly connected cluster deployment, the service address of ID generator node can be configured in the local configuration file of the application server or in the configuration center. After the application server obtains the service address list, it needs to implement service routing and connect directly to the ID generator to obtain the ID.

Problems in Snowflake algorithm

Snowflake algorithm is strongly dependent on timestamp. Once clock callback occurs, ID repetition will occur. So how does clock turn back come into being, and how do we solve this problem?

Automatic calibration of the NTP (Network Time Protocol) service may cause clock callback. Every computer around us has its own local clock, which is calculated according to the crystal pulse of CPU. However, with the passage of running time, the deviation between this time and world time will become larger and larger, so NTP is used to calibrate the clock.

In general, the probability of clock callback is also very small, because once the local time needs to be calibrated relative to world time, but the clock deviation is less than the STEP threshold (default 128ms), the computer will choose to synchronize in the way of SLEW, that is, adjust the clock speed with a speed difference of 0.5ms / s to ensure that the local clock is continuously forward and does not produce clock callback. Until the local clock is aligned with the world clock.

However, if the difference between the local clock and the world clock is greater than the STEP threshold, clock callback occurs. This STEP threshold can be modified, but the greater the modification, the longer the calibration time it takes to calibrate the SLEW. For example, the STEP threshold is set to 10 minutes, that is, the local clock and the world clock will be calibrated by SLEW when the deviation is less than 10 minutes, which will take up to 14 geniuses to complete the calibration.

In order to avoid the problem of repeating ID caused by clock callback, we can use a STEP threshold of 128ms, and compare with the previous timestamp when obtaining SnowflakeID, to determine whether the clock callback is less than 1 second. If it is less than 1 second, then wait 1 second, otherwise the service is not available, which can solve the problem of clock callback for 1 second.

Segmented step size scheme

Because Snowflake treats the timestamp as a high bit of long shaping, the minimum number generated is also very large. For example, when the time exceeds 1 second in the new era, the machine number is 1, and the millisecond is listed as 1, the generated ID has already reached 4194308097. So is there a way to generate a small number of ID in the initial state? The answer is yes, let's take a look at the step-by-step ID scheme.

Using segmented step size to generate ID is to store the step size and the current maximum ID in the database, and update the maximum value of ID in the database to increase the step size each time you get the ID.

The structure of the database core table is as follows:

CREATE TABLE `segment_ id` (`id` bigint (20) NOT NULL AUTO_INCREMENT, `biz_ type` varchar (64) NOT NULL DEFAULT'', / / business type `max`bigint (20) DEFAULT '0mm, / / current maximum ID value `step`bigint (20) DEFAULT' 10000mm, / / ID step size PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8

When getting ID, use the open transaction and the row lock to ensure that the maximum ID value of the current update is read:

Start transaction;update segment_id set max = max + step where biz_type = 'ORDER';select max from segment_id where biz_type =' ORDER';commit

Advantages and disadvantages of segmented step size ID generation scheme:

Advantages: ID generation does not depend on timestamps, and the initial value of ID generation can be gradually increased from 0.

Disadvantages: it is necessary to increase the step size of the maximum ID value when the service is restarted, and frequent restarts will waste a lot of segments.

Optimization of the above two implementation schemes

The Snowflake algorithm and the segmented step size scheme are introduced above, each of them has its own advantages and disadvantages, and we also give the corresponding optimization scheme in this paper.

ID buffer ring

To improve the concurrency performance and availability of SnowflakeID, the ID buffer ring, or ID Buffer Ring, can be used. Improved concurrency pickup can now make full use of millisecond timestamps through the use of buffer rings, and improved availability can now relatively alleviate the service unavailability caused by clock callback. The buffer ring is implemented through a fixed-length array plus a cursor hash, which does not require frequent memory allocation compared to the linked list.

When the ID buffer ring is initialized, the ID generator is asked to fill the ID buffer ring, and when the business needs to obtain the ID, the ID is obtained from the head of the buffer ring in turn. Asynchronous ID fill loading is triggered when the number of remaining ID in the ID buffer ring is less than a set threshold percentage, such as when the remaining ID number is less than 30% of the entire ID buffer ring. Asynchronous ID fill loading appends the newly generated ID to the end of the queue of the ID buffer ring, and then maps to the ID buffer ring according to the hash algorithm. There is also a separate timer asynchronous thread to regularly populate the ID buffer ring.

The following animation shows the three stages of ID buffer ring: ID initialization loading, ID consumption, and ID post-consumption padding:

Buffer Ring Initialize load,ID buffer ring initialization load: get from ID generator to ID fill to ID buffer ring until ID buffer ring is filled

Buffer Ring consume,ID buffer ring consumption: business applications get ID from ID buffer ring

Async reload, asynchronous load fills ID buffer ring: the timer thread is responsible for asynchronously getting ID from ID generator and adding it to the ID buffer queue, while mapping to the ID buffer ring according to the hash algorithm. When the ID buffer ring is filled, the asynchronous load fill ends.

The following flowchart shows the entire life cycle of the operation of the ID buffer ring, where:

IDConsumerServer: represents a business system that uses distributed ID

IDBufferRing:ID buffer ring

IDGenerator:ID generator

IDBufferRingAsyncLoadThread: the thread that asynchronously loads ID into the buffer ring

Timer: responsible for periodically adding tasks to asynchronous load threads to load ID

ID consumption process: the Buffer Ring consume mentioned above

The overall process: the client business requests to the application server, and the application server obtains the ID from the ID buffer ring, and throws the service unavailable if the ID buffer ring is empty; if the ID buffer ring contains ID, then it consumes an ID. At the same time, when consuming the ID in the ID buffer ring, if it is found that the number of ID left in the ID buffer ring is less than 30% of the capacity of the entire ID buffer ring, asynchronous loading is triggered to fill the ID buffer ring.

ID double bucket Buffer

When using segmented step-size ID, if the ID of this segment is used up, you need to update the maximum value of database segmentation before continuing to provide ID generation service. In order to reduce the possible delay caused by database update query on the performance of ID service, you can use double-bucket cache scheme to improve the availability of ID generation service.

The main principle: design two cache buckets: currentBufferBucket and nextBufferBucket, each of which stores an ID with so many steps. If the ID of the current cache bucket is used up, then the next cache bucket is set to the current cache bucket.

The following animation shows the whole process of initializing a two-bucket cache, asynchronously loading a reserve bucket, and switching a reserve bucket to the current bucket:

Current bucket initial load: initialize the current cache bucket, that is, update max = max + step, and then get the updated max value. For example, if the step size is 1000 and the updated max value is 1000, then the height of the bucket is the step size of 1000, and the bucket min = max-step + 1 = 1jmmax = 1000

Current bucket remaining id count down to 20% dint next bucket start to load. When the ID of the current cache bucket is less than 20%, you can load the next cache bucket, that is, update max = max + step, and then get the updated max value. The updated max value is 2000 min = max-step + 1 = 1001, max = 2000

Current bucket is drained,Switch current bucket to the next bucket, if the ID of the current bucket is all used up, then the next ID cache bucket is set to the current bucket

The following is the flow chart of the two-barrel Buffer:

Thank you for your reading. I believe you have some understanding of "how to talk about the practice and application of distributed ID". Go ahead and practice it. If you want to know more about it, you can follow the website! The editor will continue to bring you better articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.