Leaf- distributed ID Generation system 07/15 Update SLTechnology News&Howtos

Leaf- distributed ID Generation system

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article is excerpted from https://tech.meituan.com/2017/04/21/mt-leaf.html

April 21, 2017 author: link to Zhaodong article

What are the requirements of the business system for generating globally unique ID?

Global uniqueness: there can be no duplicate ID numbers, which is the most basic requirement.

The trend is increasing: the choice of primary keys should try to use ordered primary keys to ensure write performance.

Monotonous increment: ensure that the next ID must be larger than the previous ID, such as transaction version number, IM incremental messages, sorting and other special requirements.

Information security: if ID is continuous, it is easy to be picked up by malicious users, so in some application scenarios, ID will be required to be irregular and irregular.

The ID generation system should do the following:

Average latency and TP999 latency should be as low as possible

Availability 5 9s

High QPS.

Several ID summaries:

I. UUID

The standard form of UUID (Universally Unique Identifier) consists of 32 hexadecimal digits, divided into five segments by hyphens and 36 characters in the form of 8-4-4-4-12.

Advantages:

Very high performance: locally generated, no network consumption.

Disadvantages:

Not easy to store: UUID is too long, 16 bytes and 128bits, usually expressed as a 36-length string, which is not suitable for many scenarios.

Information insecurity: algorithms that generate UUID based on MAC addresses may cause MAC address leaks, a vulnerability that has been used to find the location of the maker of the Melissa virus.

When ID is used as a primary key, there will be some problems in a specific environment. For example, in the case of DB primary key, UUID is very unsuitable:

① MySQL officials clearly recommend that the primary key should be as short as possible [4]. The 36-character length of UUID does not meet the requirements.

② is disadvantageous to the MySQL index: if it is the primary key of the database, the disorder of UUID under the InnoDB engine may cause frequent changes in data location and seriously affect performance.

two。 Snowflake-like scheme

Generally speaking, this scheme is an algorithm for generating ID by dividing namespaces (UUID is also calculated, because it is more common, so it is analyzed separately). This scheme divides the 64-bit into multiple segments to mark the machine, time, etc.

The advantages and disadvantages of this approach are:

Advantages:

The number of milliseconds is high, the self-increasing sequence is low, and the whole ID is increasing.

Do not rely on databases and other third-party systems, deployed in the way of services, higher stability, the performance of generating ID is also very high.

Bit bits can be allocated according to their own business characteristics, which is very flexible.

Disadvantages:

Strongly dependent on the machine clock, if the clock on the machine is dialed back, it will cause the number to be duplicated or the service will be unavailable.

three. Database generation

Take MySQL as an example. Set auto_increment_increment and auto_increment_offset to the field to ensure that the ID is self-increasing. Each business uses the following SQL to read and write MySQL to get the ID number.

Begin

REPLACE INTO Tickets64 (stub) VALUES ('a')

SELECT LAST_INSERT_ID ()

Commit

The advantages and disadvantages of this scheme are as follows:

Advantages:

Very simple, using the functions of the existing database system to achieve, low cost, professional maintenance of DBA.

The number of ID is monotonous and self-increasing, which can realize some services with special requirements for ID.

Disadvantages:

Strongly dependent on DB, the whole system is not available when DB is abnormal, which is a fatal problem. Configuring master-slave replication can increase availability as much as possible, but data consistency is difficult to guarantee in special cases. Inconsistencies in master-slave switching may lead to repeated numbering.

The bottleneck of ID numbering performance is limited to the read and write performance of a single MySQL.

For the MySQL performance problem, we can use the following solution: in a distributed system, we can deploy several more machines, each machine sets a different initial value, and the step size is equal to the number of machines.

For example, there are two machines. Set the step size step to 2. The initial value of TicketServer1 is 1 (1, 3, 5, 7, 7, 9, 11 … ), the initial value of TicketServer2 is 2 (2, 4, 4, 6, 8, 10... ).

This is a primary key generation strategy (Ticket Servers: Distributed Unique Primary Keys on the Cheap) written by the Flickr team in 2010.

As shown below, in order to set the corresponding parameters of the two machines respectively, TicketServer1 starts with 1, TicketServer2 starts with 2, and the two machines increment by 2 each time.

Image

This architecture seems to meet the performance requirements, but has the following disadvantages:

It is difficult to scale up at the system level, for example, after defining the step size and the number of machines, what should I do if I want to add machines? Suppose there is only one machine that sends the number 1meme 2jor3jor4jin5 (step size is 1). At this time, you need to expand the capacity of a machine. You can do this: set the initial value of the second machine to be much higher than that of the first machine, such as 14 (assuming that the first machine cannot be sent to 14 within the expansion time), and set the step size to 2. Then the number sent by this machine is even after 14. Then take off the first one, leave the ID value odd, such as 7, and then change the step size of the first one to 2. To make it conform to our defined number range standard, for this example is to make the first one can only generate odd numbers. Does the expansion plan look complicated? It seems to be all right. Now imagine if we have 100 machines online, what should we do to expand capacity at this time? It's a nightmare. Therefore, the horizontal expansion scheme of the system is complex and difficult to implement. ID does not have the feature of monotonous increment and can only increase in trend. This disadvantage is not very important for general business requirements and can be tolerated. The pressure on the database is still great. Every time you get the ID, you have to read and write the database, and you can only rely on the heap machine to improve the performance.

The name Leaf comes from a sentence by the German philosopher and mathematician Leibniz: > There are no two identical leaves in the world > "there are no two identical leaves in the world."

A comprehensive comparison of the above schemes, each scheme does not fully meet our requirements. Therefore, Leaf makes corresponding optimizations on the second and third schemes mentioned above, and realizes Leaf-segment and Leaf-snowflake schemes.

Leaf-segment database scheme

The first Leaf-segment scheme, in the scheme of using the database, has made the following changes:-the original scheme has to read and write the database every time it acquires ID, resulting in great pressure on the database. Instead, use proxy server to get the value of the segment (size determined by step) segment one at a time. After using it, go to the database to get a new number segment, which can greatly reduce the pressure on the database. -the different numbering requirements of each business are distinguished by the biz_tag field. The ID acquisition of each biz-tag is isolated from each other and does not affect each other. If there is a performance requirement to expand the database capacity in the future, you do not need the complex expansion operation described above, you only need to split the biz_tag database into tables.

The database table is designed as follows:

Important field description: biz_tag is used to distinguish between services. Max_id represents the maximum value of the ID segment assigned to the biz_tag, and step represents the length of each assigned segment. It used to be that you need to write to the database every time you get ID, but now you just need to set the step to be large enough, say 1000. Then only when 1000 numbers have been consumed will you read and write the database again. The frequency of reading and writing to the database has been reduced from 1 to 1/step, and the approximate structure is shown in the following figure:

Image

Test_tag on the first Leaf machine is a range of 1x 1000. When this segment is used up, it will load another segment of length step=1000. Assuming that the other two segments have not been updated, the new range loaded by the first machine should be 3001mm 4000. At the same time, the max_id of the biz_tag corresponding to the database will be updated from 3000 to 4000. The SQL statement for the update number segment is as follows:

BeginUPDATE table SET max_id=max_id+step WHERE biz_tag=xxxSELECT tag, max_id, step FROM table WHERE biz_tag=xxxCommit

This model has the following advantages and disadvantages:

Advantages:

Leaf services can be easily extended linearly, and their performance can fully support most business scenarios. The ID number is the 64-digit number of the 8byte with increasing trend, which meets the primary key requirements of the above database storage. High disaster tolerance: Leaf service has a number segment cache inside. Even if DB is down, Leaf can still provide services normally within a short period of time. You can customize the size of max_id, which makes it very convenient for businesses to migrate from the original ID mode.

Disadvantages:

The ID number is not random enough to reveal the information about the number of numbers issued, so it is not safe. The TP999 data fluctuates greatly, and when the number segment is used up, there will still be occasional spikes in the tg999 data on the I hang O that updates the database. DB downtime may make the entire system unavailable. Double buffer optimization

For the second disadvantage, Leaf-segment has made some optimizations, which are simply:

The timing of Leaf segment retrieval occurs when the number segment is exhausted, which means that the ID delivery time of the critical point of the number segment depends on the next time the number segment is retrieved from DB, and the incoming requests during this period will also cause thread blocking because the DB number segment is not returned. If the performance of the network requesting DB and DB is stable, this situation will have little impact on the system, but if the network jitter occurs when taking DB, or slow query occurs in DB, the response time of the whole system will slow down.

To this end, we hope that the process of DB number segment can be non-blocking, and there is no need to block the request thread when DB number segment is taken, that is, when the number segment is consumed to a certain point, the next number segment will be loaded into memory asynchronously. You don't have to wait until the number field is exhausted to update the number field. By doing so, the TP999 index of the system can be greatly reduced. The detailed implementation is shown below:

Image

Using the dual buffer method, there are two segment cache segment inside the Leaf service. When the current number range has been issued 10%, if the next number segment is not updated, start another update thread to update the next number segment. After the current number segment is all sent, if the next number segment is ready, then switch to the next number segment as the current segment and then send it over and over again.

Each biz-tag has consumption speed monitoring. It is recommended that the segment length be set to 600x (10 minutes) of the service peak QPS, so that even if the DB goes down, the Leaf can continue to send the number for 10-20 minutes without being affected.

When each request comes, it will judge the status of the next segment and update it, so the occasional network jitter will not affect the update of the next segment.

Leaf highly available disaster recovery

For the third point of "DB availability", we currently adopt the method of "one master and two slaves". At the same time, the server room is deployed, and semi-synchronous mode is adopted between Master and Slave to synchronize data. At the same time, use the company's Atlas database middleware (open source, renamed DBProxy) as the master-slave switch. Of course, this scheme will degenerate into asynchronous mode in some cases, even in very extreme cases, it will still cause data inconsistency, but the probability of occurrence is very small. If your system wants to ensure that 100% of the data is strongly consistent, you can choose a strongly consistent MySQL scheme implemented by "Paxos-like algorithms", such as MySQL Group Replication, which was just GA in MySQL 5.7 a while ago. However, the cost and energy of operation and maintenance will increase accordingly, and the type can be selected according to the actual situation.

Image

At the same time, Leaf services are deployed in IDC, and the internal service framework is "MTthrift RPC". When the service is called, the Leaf service in the same room will be called first according to the load balancing algorithm. Only when the Leaf service is not available in this IDC will you select the Leaf service in other data centers. At the same time, the service governance platform OCTO also provides service protection measures such as overload protection, one-click closure, dynamic traffic distribution and so on.

Leaf-segment scheme can generate ID with increasing trend, and the ID number is calculable, so it is not suitable for order ID generation scenarios. For example, competing pairs place orders at 12:00 on two days, and the company's daily order quantity can be roughly calculated by subtracting the order id number, which is unbearable. In the face of this problem, we provide a Leaf-snowflake scheme.

Image

The Leaf-snowflake scheme completely follows the bit bit design of the snowflake scheme, that is, the ID number is assembled in the way of "1 # 41 # 10 # 12". For the allocation of workerID, when the number of service clusters is small, it can be configured manually. The scale of Leaf service is large, and the cost of hands-on configuration is too high. So wokerID is automatically configured on the snowflake node using the features of the Zookeeper persistent order node. Leaf-snowflake is started by following the following steps:

Start the Leaf-snowflake service, connect to the Zookeeper, and check whether you have registered under the leaf_forever parent node (if there is a sequence child node). If you have registered to retrieve your own workerID directly (the int type ID number generated by zk sequential nodes), start the service. If it is not registered, create a persistent sequence node under the parent node, and after successful creation, retrieve the sequence number as your own workerID number and start the service.

Image weakly dependent on ZooKeeper

In addition to going to ZK to get data each time, a workerID file is also cached on the native file system. When there is a problem with ZooKeeper and the machine has a problem and needs to be restarted, the service can be started normally. This achieves a weak dependence on tripartite components. Improved SLA to some extent.

Solve the clock problem

Because this scheme depends on time, if the machine's clock is dialed back, it is possible to generate duplicate ID numbers, and the problem of clock fallback needs to be solved.

Image

Refer to the whole startup flow chart above. When the service starts, check whether it has written the ZooKeeper leaf_forever node:

If it has been written, compare its own system time with the recording time of the leaf_forever/$ {self} node. If it is less than the leaf_forever/$ {self} time, it is considered that the machine time has been dialed back in a big step, and the service fails to start and give an alarm. If it is not written, prove to be a new service node, directly create a persistent node leaf_forever/$ {self} and write its own system time, and then comprehensively compare the system time of other Leaf nodes to determine whether its own system time is accurate. The specific way is to take the service IP:Port of all temporary nodes under leaf_temporary (all running Leaf-snowflake nodes), and then obtain the system time of all nodes through RPC request. Calculate sum (time) / nodeSize. If abs (system time-sum (time) / nodeSize) < threshold, consider the current system time to be accurate, start the service normally, and write the temporary node leaf_temporary/$ {self} to maintain the lease. Otherwise, it is considered that the local system time has a large step offset, startup failure and alarm. Write leaf_forever/$ {self} at regular intervals (3s) to report your own system time.

Because it is strongly dependent on the clock and is sensitive to time requirements, NTP synchronization will also cause a second-level fallback when the machine is working. It is recommended to turn off NTP synchronization directly. Or directly return to ERROR_CODE without providing service when the clock is called back, and wait for the clock to catch up. Or do a layer retry, and then report to the alarm system, or automatically remove its own node and alarm when it is found that the clock is dialed back, as follows:

/ / callback occurred. The time is less than the last time if (timestamp < lastTimestamp) {long offset = lastTimestamp-timest if (offset)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.