What is a highly concurrent system like? 04/21 Update SLTechnology News&Howtos

What is a highly concurrent system like?

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "what a highly concurrent system is like". The content of the explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought. Let's study and learn what a high concurrency system is like.

Evolution of microservice architecture

In the early days of the Internet, a single architecture was enough to support day-to-day business needs, and all of our business services were deployed on a physical machine in one project.

All the business, including your trading system, membership information, inventory, goods, and so on, are intermingled. Once the traffic is up, the problem of the single architecture is exposed, and all the business cannot be used when the machine is dead.

As a result, the architecture of the cluster architecture begins to appear, and the pressure that a single machine cannot resist, the easiest way is to expand horizontally and horizontally, so that the pressure traffic is distributed to different machines through load balancing. for the time being, the problem of service unavailability caused by a single point is solved.

However, with the development of the business, maintaining all the business scenarios in a project makes it more and more difficult to develop and maintain the code. A simple requirement change requires the release of the entire service, and code merge conflicts become more and more frequent. At the same time, the greater the possibility of online failure. The architectural model of micro-services was born.

Separate and deploy each separate business, the cost of development and maintenance is reduced, the pressure on the cluster is also increased, and there will no longer be a small change point that needs to affect the whole body.

From the point of view of high concurrency, the above points seem to be classified as improving the anti-pressure ability of the overall system through service splitting and the expansion of cluster physical machines, then, the problems caused by the split are the problems that the high concurrency system needs to solve.

RPC

The benefits and convenience of the split of microservices are obvious, but at the same time the communication between microservices needs to be considered.

The traditional HTTP communication mode is a great waste of performance, so it is necessary to introduce a RPC framework such as Dubbo to improve the efficiency of the whole cluster communication based on TCP long connection.

Let's assume that the original QPS from the client is 9000, then the distribution to each machine through the load balancing policy is 3000. After the HTTP is changed to RPC, the time consuming of the interface is shortened, and the QPS of the stand-alone and the whole machine is improved.

The RPC framework itself generally has its own load balancing and circuit breaker degradation mechanism, which can better maintain the high availability of the whole system.

Well, after talking about RPC, some of the basic principles of Dubbo, which are basically the common choice in China, are the following questions.

How Dubbo works:

When the service starts, provider and consumer connect to the registry register according to the configuration information, and register and subscribe to the service with the registry, respectively.

According to the service subscription relationship, register returns provider information to consumer, while consumer caches provider information locally. If the information changes, consumer will receive a push from register.

Consumer generates a proxy object, selects a provider according to the load balancing strategy, and records the number of calls and time information of the interface to the monitor at a regular time.

After getting the proxy object, consumer initiates the interface call through the proxy object.

After receiving the request, the provider deserializes the data and then implements it through the proxy calling the specific interface.

Dubbo load balancing strategy:

Weighted random: suppose we have a set of servers servers= [A, B, C] whose corresponding weights are weights= [5, 3, 2], with a total weight of 10.

Now tile these weights on one-dimensional coordinate values, the interval [0,5) belongs to server A, the interval [5,8) belongs to server B, and the interval [8,10) belongs to server C.

Then generate a random number in the range of [0,10) through the random number generator, and then calculate the interval on which the random number will fall.

Minimum active number: each service provider corresponds to an active number of active. In the initial case, the active number of all service providers is 0. Each time a request is received, the number of active users is increased by 1, and after the request is completed, the number of active users is reduced by 1.

After the service has been running for a period of time, high-performance service providers can process requests faster, so the number of active service providers decreases faster, and such service providers can get new service requests first.

Consistent hash: through the hash algorithm, the invoke of provider and the random node are generated into hash, and the hash is projected onto the circle of [0,2 ^ 32-1]. When querying, md5 is performed according to key and then hash is performed to get the invoker of the first node whose value is greater than or equal to the current hash.

Weighted polling: for example, if the weight ratio of server A, B, and C is 5:2:1, server A will receive five of the eight requests, server B will receive two of the requests, and server C will receive one of the requests.

Cluster fault tolerance:

Automatic switching of Failover Cluster failure: the default fault tolerance scheme of Dubbo automatically switches to other available nodes when the call fails. The specific number of retries and interval can be configured when the reference service is available. The default number of retries is 1, that is, it is only called once.

Failback Cluster Quick failure: after the call fails, log and call information are recorded, then an empty result is returned to consumer, and the failed call is retried every 5 seconds by a scheduled task.

Automatic recovery of Failfast Cluster failure: it will only be called once, and an exception will be thrown immediately after failure.

Failsafe Cluster failure security: an exception occurs in the call, the log is not thrown, and an empty result is returned.

Forking Cluster invokes multiple service providers in parallel: multiple threads are created through the thread pool, and multiple provider are called, and the results are saved to the blocking queue. As soon as one provider successfully returns the result, it will return the result immediately.

Broadcast Cluster broadcast mode: each provider is called one by one. If one of them reports an error, an exception is thrown at the end of the loop call.

Message queue

Everyone should be well aware of the role of MQ, peak cutting, valley filling and decoupling. Relying on message queue and changing from synchronous to asynchronous, the coupling between microservices can be reduced.

For some interfaces that do not need to be executed synchronously, they can be executed asynchronously by introducing message queues to improve interface response time.

After the completion of the transaction, you need to deduct the inventory, and then you may need to issue points to the members. In essence, the action of issuing points should belong to the compliance service, and the real-time requirements are not high. As long as we ensure the ultimate consistency, that is, we can successfully implement the contract.

For requests of this similar nature, you can use MQ asynchronism, which improves the anti-pressure ability of the system.

For message queuing, how to ensure that the message is reliable and not lost when it is used?

Message reliability

Message loss may occur in three aspects: the producer sends the message, the MQ itself loses the message and the consumer loses the message.

① producer lost

The possible point where the producer loses the message is that the program fails to send an exception and does not retry the process, or if the sending process is successful but the network flash MQ is not received during the process, the message is lost.

Since synchronous sending usually does not occur in this way, we do not consider the problem of synchronous sending, we will talk about it based on the scenario of asynchronous sending.

Asynchronous transmission is divided into two ways: asynchronous with callback and asynchronous without callback and no callback. After the producer sends it, the message may be lost regardless of the result. We can make a solution in the form of asynchronous sending + callback notification + local message table.

An example of the following single scenario:

Save the local data and the MQ message table after placing the order. At this time, the message status is being sent. If the local transaction fails, the order will fail and the transaction will be rolled back.

If the order is issued successfully, the client is returned to the client successfully and the MQ message is sent asynchronously.

The MQ callback notifies the sending result of the message, and updates the database MQ delivery status accordingly.

JOB polls for more than a certain period of time (depending on the business configuration) and does not send a successful message to retry.

In the monitoring platform configuration or JOB program to deal with more than a certain number of unsuccessful messages, alarms, manual intervention.

Generally speaking, the form of asynchronous callback is fine for most scenarios, and only for those scenarios where there is a complete guarantee that messages will not be lost, we will do a complete solution.

② MQ is missing

If the producer guarantees that the message is sent to the MQ, and the MQ is still in memory after receiving the message, the message may be lost if it goes down and does not have time to synchronize to the slave node.

For example, RocketMQ:RocketMQ is divided into synchronous flushing and asynchronous flushing. The default is asynchronous flushing, which may cause the message to be lost before it reaches the hard disk. You can ensure the reliability of the message by setting it to synchronous flushing, so that even if the MQ is hung up, the message can be recovered from the disk when it is restored.

For example, Kafka can also be configured:

Acks=all returns producer success only if all nodes participating in the replication receive the message. In this way, the message will not be lost unless all the nodes are dead. Replication.factor=N, set a number greater than 1, which will require at least 2 copies of each partion min.insync.replicas=N, and set a number greater than 1. This will require leader to sense that at least one follower is still connected to retries=N, and set a very large value to allow the producer to send failure and keep retrying.

Although we can achieve the goal of high availability of MQ itself through configuration, there is a loss of performance, and how to configure it needs to be weighed according to the business.

③ consumers lost

The scenario in which the consumer loses the message: the consumer has just received the message and the server goes down. MQ believes that the consumer has already consumed the message and will not send the message repeatedly. The message is lost.

By default, RocketMQ requires consumers to reply to ack for confirmation, while Kafka needs to manually enable configuration and disable automatic offset.

Consumers do not return ack confirmation, and the mechanism of retransmission varies according to the type of MQ. If the retry exceeds the number of times, it will enter the dead letter queue and need to be handled manually. (Kafka does not have these.)

Final consistency of messages

Transaction messages can achieve the final consistency of distributed transactions. Transaction messages are distributed transaction capabilities similar to XA provided by MQ.

Semi-transactional message means that MQ receives a message from the producer, but does not receive a second acknowledgement and cannot deliver the message.

The principle of implementation is as follows:

The producer first sends a semi-transactional message to MQ.

After receiving the message, MQ returns to ack for confirmation.

The producer begins to perform local transactions.

If the transaction execution succeeds in sending commit to MQ, it fails to send rollback.

If MQ does not receive a second confirmation from the producer for a long time, commit or rollback,MQ initiates a message check to the producer.

The producer queries the final status of the transaction execution.

Submit a second confirmation again according to the status of the query transaction.

Finally, if MQ receives a second acknowledgement commit, it can deliver the message to the consumer, whereas if it is rollback, the message will be saved and deleted after 3 days.

Database

For the whole system, finally all the queries and writes of traffic fall on the database, and the database is the core of supporting the high concurrency ability of the system.

How to reduce the pressure on the database and improve the performance of the database is the cornerstone to support high concurrency. The main way is to solve this problem through read-write separation and database subtable.

For the whole system, the flow should be in the form of a funnel. For example, if we have 200000 daily active users DAU, it is possible that only 30,000 QPS users come to the bill of lading page every day, and only 10,000 QPS is successfully paid by placing an order.

Then for the system, reading is greater than writing, and at this time, the pressure on the database can be reduced by the separation of reading and writing.

Read-write separation is equivalent to database clustering, which reduces the pressure on a single node. In the face of the rapid growth of data, the original storage mode of single database and single table has been unable to support the development of the whole business, so it is necessary to divide the database and tables.

For micro-services, the vertical sub-library itself has been done, and most of the rest are sub-table solutions.

Horizontal subtable

First of all, decide which field to use as the sub-table field (sharding_key) according to the business scenario. For example, our current daily order is 10 million. Most of our scenarios come from the C side, and we can use user_id as the sharding_key.

Data query supports orders in the last 3 months, and archiving for more than 3 months, then the amount of data in 3 months is 900 million, which can be divided into 1024 tables, so the data in each table is about 1 million.

For example, if the user id is 100, then we all go through hash (100), and then we take the module for 1024, and we can fall on the corresponding table.

The uniqueness of ID after subtabulation

Because our primary keys are self-increasing by default, then the primary keys after the sub-table must conflict in different tables.

There are several ways to consider:

Set the step size, for example, 1-1024 tables we set the basic step size of 1-1024, so that the primary key falls on different tables will not conflict.

Distributed ID, implement a set of distributed ID generation algorithms or use open source algorithms such as Snowflake algorithm.

After dividing the table, instead of using the primary key as the query basis, each table adds a new field as the unique primary key. For example, the order number of the order form is unique, no matter which table it finally falls on, it is based on the order number as the query basis, and so is the update.

Master-slave synchronization principle

The principle of master-slave synchronization is as follows:

After master commits the transaction, it writes to binlog.

Slave connects to master to get binlog.

Master creates a dump thread that pushes binglog to slave.

Slave starts an IO thread to read the binlog of the synchronized master and record it in the relay log relay log.

Slave starts another sql thread to read the relay log event and execute it in slave to complete the synchronization.

Slave records its own binglog.

Since the default replication mode of MySQL is asynchronous, the master database does not care whether the slave database has been processed after sending the log to the slave database. This will give rise to a problem, that is, if the master database fails and the slave database fails, the log will be lost when the slave database is upgraded from the library to the master database. This gives rise to two concepts.

① full synchronous replication

After the master library is written to binlog, the log is forced to be synchronized to the slave library, and all the slave libraries are returned to the client after execution, but it is obvious that the performance will be seriously affected in this way.

② semi-synchronous replication

Different from full synchronization, the logic of semi-synchronous replication is like this. After the log is successfully written to the slave database, the master database returns ACK to confirm it to the master database. The master database receives at least one confirmation from the slave library and considers that the write operation is complete.

Caching

As a representative of high performance, cache may bear more than 90% of hot traffic in some special services.

For some scenarios where concurrent QPS may be hundreds of thousands of dollars, such as second kill, introducing cache preheating can greatly reduce the pressure on the database. 100000 of QPS may fail for stand-alone databases, but it is not a problem for caches such as Redis.

Take the second kill system as an example, the activity preheating product information can be cached in advance to provide query service, the activity inventory data can be cached in advance, and the order issuing process can be completely deducted from the cache, and then write to the database asynchronously after the second kill is over. the pressure on the database is much less.

Of course, a series of issues such as cache breakdown, avalanches and hotspots will be considered after the introduction of caching.

Hot Key problem

The so-called hot key problem is that suddenly hundreds of thousands of requests to access a particular key on the redis will cause traffic to be too concentrated and reach the upper limit of the physical network card, resulting in an avalanche when the redis server goes down.

Solutions for hot key:

Break up the hot key to different servers in advance to reduce the pressure.

Add a second-level cache to load hot key data into memory in advance. If redis is down, query it in memory.

Cache breakdown

The concept of cache breakdown is that the concurrent access of a single key is too high, and when it expires, all requests are called directly to the DB. This is similar to the problem of hot key, except that the expiration causes all requests to be called to the DB.

Solution:

Lock update, for example, request query A, find that it is not in the cache, lock the key A, query the data in the database, write it to the cache, and return it to the user, so that the later requests can get the data from the cache.

Write the expiration time combination in value, and constantly refresh the expiration time asynchronously to prevent this kind of phenomenon.

Cache penetration

Cache traversal means that the data in the cache does not exist in the query, and each request will hit DB, just as if the cache does not exist.

To solve this problem, add a layer of Bloom filter. The principle of the Bloom filter is that when you save data, you map it to K points in a bit array through a hash function, and set them to 1.

In this way, when the user queries An again, and A has a value of 0 in the Bloom filter and returns directly, there will be no breakdown request to hit the DB.

Obviously, a problem after using Bloom filter is misjudgment, because it is an array, and multiple values may fall into the same position, so theoretically, as long as our array is long enough, the probability of misjudgment will be lower. This kind of problem should be based on the actual situation.

Cache avalanche

When a large-scale cache invalidation occurs at some point, such as when your cache service goes down, a large number of requests will come in and call DB directly, which may lead to the collapse of the whole system, which is called avalanche.

The problem with avalanches is not quite the same as the problem with breakdown and hot key, which means that large-scale caches are expired.

Several solutions for avalanches:

Set different expiration times for different key to avoid simultaneous expiration

Current limit. If redis is down, it can be limited to avoid the collapse of DB by a large number of requests at the same time.

Second-level cache, the same as the hot key scheme.

Stability.

Circuit breaker: for example, the abnormal situation of marketing service hanging or a large number of interface timeout can not affect the main link where the order was placed. Some operations involving deduction of points can be remedied afterwards.

Current limit: for sudden high concurrency such as promoting second kill, if some interfaces do not do current limit processing, the service may be shut down directly. It is particularly important to make appropriate current limit for the evaluation of the pressure test performance of each interface.

Demotion: after a circuit breaker, it can actually be said that it is a kind of demotion. For example, the demotion plan after a circuit breaker of the marketing interface is not to invoke the marketing service for a short period of time, and then to invoke it after the marketing resumes.

Plan: generally speaking, even if there is a unified configuration center, no changes are allowed at the peak of the business, but some changes can be made in case of emergency by configuring a reasonable plan.

Check: in view of the consistency of distributed transactions generated by various distributed systems or data anomalies caused by attacks, it is very necessary to check the platform to do the final data verification.

For example, whether the amount of the downstream payment system and the order system is correct, and whether the data of man-in-the-middle attack is correct.

Thank you for your reading. the above is the content of "what a high concurrency system is". After the study of this article, I believe you have a deeper understanding of what a high concurrency system is. The specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.