How to understand snowflake algorithm and Baidu Meituan 04/25 Update SLTechnology News&Howtos

How to understand snowflake algorithm and Baidu Meituan

2025-04-25 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to understand snowflake algorithm and Baidu Meituan, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can get something.

When it comes to the distributed ID automatic generation scheme, everyone must be very familiar with it, and you can immediately name several solutions that you are good at. Indeed, as an important symbol of system data, the importance of ID is self-evident, and various schemes have been optimized for many generations. Please allow me to classify the distributed ID automatic generation scheme from this perspective:

Mode of realization

Completely dependent on the data source method

ID generation rules, read control is completely controlled by the data source, such as database self-growing ID, serial number, etc., or INCR/INCRBY atomic operation of Redis to generate sequence number and so on.

Semi-dependent data source mode

According to the generation rules of ID, some generation factors need to be controlled by the data source (or configuration information), such as the snowflake algorithm.

Do not depend on the data source method

The generation rules of ID are calculated independently by machine information, and do not rely on any configuration information and data records, such as common UUID,GUID, etc.

Practical scheme

The practical scheme is suitable for the three implementation methods mentioned above, and can be used as a supplement to these three implementation methods, which aims to improve the system throughput, but the limitations of the original implementation methods still exist.

Real-time acquisition scheme

As the name implies, every time you want to get an ID, it is generated in real time. Simple and fast, ID is continuous, but the throughput may not be the highest.

Pre-generated scheme

Generate a batch of ID in advance and put it in the data pool, which can be simply self-growing, or you can set the step size and generate it in batches. You need to store the pre-generated data in a storage container (JVM memory, Redis, database tables). Throughput can be greatly improved, but temporary storage space needs to be opened up, and existing ID,ID may be lost after power outage.

Brief introduction of the scheme

The following is a brief introduction to the popular distributed ID schemes

Database self-growing ID

It is completely dependent on the data source, and all ID is stored in the database, which is the most commonly used method to generate ID. It is the most widely used in the single application period, using the auto_increment of the database as the primary key when creating the data table, or using the sequence to complete some self-growing ID requirements of other scenarios.

Advantages: very simple, orderly increment, convenient paging and sorting.

Disadvantages: after sub-database and sub-table, the self-increasing ID of the same data table is easy to repeat and cannot be used directly (step size can be set, but the limitation is obvious); the overall performance throughput is low, and if a separate database is designed to achieve data uniqueness of distributed applications, even if pre-generated solutions are used, there will be a single point of bottleneck in high concurrency scenarios because of transaction locks.

Applicable scenarios: table ID of a single database instance (including master-slave synchronization scenarios), some serial numbers counted by day, etc.; sub-database sub-table scenarios and system-wide unique ID scenarios are not applicable.

Redis generates ID

It also belongs to the way of completely relying on the data source. Through the INCR/INCRBY self-increasing atomic operation command of Redis, we can ensure that the generated ID must be unique and orderly, which is essentially the same as the database.

Pros: the overall throughput is higher than the database.

Cons: it is a bit difficult to find the latest ID value after the Redis instance or cluster goes down.

Applicable scenarios: it is suitable for counting scenarios, such as user visits, order serial number (date + serial number), etc.

UUID, GUID generate ID

UUID: calculated according to the standard established by OSF, the Ethernet card address, nanosecond time, chip ID code and many possible numbers are used. A combination of the following parts: current date and time (the first part of UUID is related to time, if you generate a UUID and then generate another UUID in a few seconds, the first part is different and the rest are the same), clock sequence, globally unique IEEE machine identification number (if there is a network card, obtained from the network card, no network card is obtained otherwise)

GUID: Microsoft's implementation of the UUID standard. There are various other implementations of UUID, not just GUID, but not one by one.

These two methods are independent of data sources and are truly globally unique ID.

Advantages: do not rely on any data sources, self-calculation, no network ID, super fast, and the only one in the world.

Disadvantages: no sequence, and relatively long (128bit), as the database primary key, index will lead to index efficiency decline, take up more space.

Applicable scenarios: as long as there are no stringent requirements for storage space, such as a variety of link tracking, log storage and so on.

4. Snowflake algorithm (snowflake algorithm) generates ID

It belongs to semi-dependent data source mode. The principle is to use Long type (64 bits) and fill it according to certain rules: time (millisecond) + cluster ID+ machine ID+ sequence number. The number of bits occupied by each part can be allocated according to actual needs. In practical application scenarios, cluster ID and machine ID depend on external parameter configuration or database records.

Advantages: high performance, low latency, decentralization, order by time

Disadvantages: machine clock synchronization is required (up to seconds)

Applicable scenario: data primary key in distributed application environment

Does Snowflake ID algorithm sound particularly suitable for distributed architecture scenarios? For now, yes, let's focus on its principles and best practices.

Implementation principle of snowflake algorithm

The snowflake algorithm comes from Twitter, which is implemented in Scala language and uses the Thrift framework to call the RPC interface. The initial project is due to the fact that there is no readily available ID generation mechanism for database migration from mysql to Cassandra,Cassandra, which gives birth to this project. If you are interested in the existing github source code, you can take a look at it.

Snowflake algorithm is ordered and unique, and requires high performance and low latency (each machine generates at least 10k pieces of data per second, and the response time is less than 2ms). It should be used in a distributed environment (multi-cluster, cross-server room). Therefore, the ID obtained by snowflake algorithm is composed of segments:

Time difference from the specified date (milliseconds), 41 bits, enough for 69 years

Cluster ID + machine ID, 10 bits, supporting up to 1024 machines

Sequence, 12 bits, each machine generates up to 4096 serial numbers per millisecond

As shown in the figure:

1bit: symbol bit, which is always 0, indicating that all ID are positive integers

41bit: the time difference in milliseconds is 69 years from the specified date. We know that the timestamp represented by the Long type starts at 00:00:00 on 1970-01-01. Our timestamp here can specify a date, such as 2019-10-23 00:00:00.

10bit: machine ID, which can be deployed in different locations, and can also be configured with multiple clusters. You need to plan offline the number of each computer room, each cluster and each instance ID.

12bit: sequence ID, which can be up to 4096 if all the previous ones are the same

The above digit allocation is only officially recommended, and we can allocate it according to our actual needs. For example, we only have a few dozen application machines at most, but with a large number of concurrency, we can reduce 10bit to 8bit, increase the sequence from 12bit to 14bit, and so on.

Of course, the meaning of each part can also be freely replaced, such as the machine ID in the middle part. If it is a cloud computing and containerized deployment environment, and there are operations to expand and reduce the machine at any time, and it is not realistic to configure the ID of an instance through offline planning, it can be replaced with each restart of the instance. Take a self-growing ID as the content of this part, which will be explained below.

There are also great gods on github who use Java to do the most basic implementation of snowflake. Check the source code: snowflake java version source code directly here.

/ * * twitter's snowflake algorithm-- java implementation * * @ author beyond * @ date 2016-11-26 * / public class SnowFlake {/ * * starting timestamp * / private final static long START_STMP = 1480166465631L; / * * the number of bits occupied by each part * / private final static long SEQUENCE_BIT = 12; / / the number of digits occupied by the sequence number private final static long MACHINE_BIT = 5 / / number of bits occupied by the machine ID private final static long DATACENTER_BIT = 5 bits occupied by the data center / * maximum value of each part * / private final static long MAX_DATACENTER_NUM =-1L ^ (- 1L)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.