Availability up to 5 9s! Practice of High availability Architecture Design of payment system 07/12 Update SLTechnology News&Howtos

Availability up to 5 9s! Practice of High availability Architecture Design of payment system

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. background

For Internet applications and large-scale enterprise applications, most of them are required to run 24 hours a day as much as possible, but it can be said that it is "difficult" to achieve complete uninterrupted operation. For this reason, there are generally 3 9s to 5 9s to measure the degree of application availability.

Unavailable time (minutes) 99.9% 0.1% "36524" 60525.699.99% 0.01%" 36524" 6052.5699.999% 0.001% "36524" 605.256

For an application with increasing functions and data, it is not easy to maintain high availability. In order to achieve high availability, Yixin payment system has done a lot of exploration and practice from avoiding single point of failure, ensuring the high availability of the application itself, solving the growth of transaction volume and so on.

Without considering the sudden failures of external dependent systems, such as network problems, tripartite payments and large-scale unavailability of banks, the service capacity of Yixin payment system can reach 99.999%.

This article focuses on how to improve the usability of the application itself, and how to avoid a single point of failure and solve the problem of transaction volume growth will be discussed in other series.

In order to improve the availability of the application, the first thing to do is to avoid the failure of the application as much as possible, but it is impossible to avoid failure completely. The Internet is a place prone to the "butterfly effect". Any seemingly small accident with a probability of zero may occur and then be magnified infinitely.

We all know that RabbitMQ itself is very stable and reliable. Yixin payment system has been using a single point of RabbitMQ at the beginning, and it has never broken down, so we all think that this thing is unlikely to go wrong.

Until one day, the physical host hardware of this node was damaged due to disrepair, and the RabbitMQ was unable to provide services, resulting in instant unavailability of system services.

The failure is not terrible, the most important thing is to find and solve the fault in time. The requirement of Yixin payment system to its own system is to find the fault in seconds, diagnose and solve the fault quickly, so as to reduce the negative impact of the fault.

Second, the question

Learn from history. First of all, let's briefly review some of the problems encountered by Yixin payment system:

The main results are as follows: (1) when dealing with the newly accessed three-party channel, the new developer ignores the importance of setting timeout due to lack of experience. It is such a small detail that causes all transactions in the three-party queue to be blocked and affects transactions in other channels.

(2) the trustworthy payment system is distributed and supports grayscale release, so the environment and deployment modules are very many and complex. Once a new module is added, because there are multiple environments, and each environment is dual-node, the number of connections to the database is insufficient after the new module is online, thus affecting the functions of other modules.

(3) it is also a timeout problem. A three-party timeout causes all the currently configured worker threads to be exhausted, so that other transactions have no threads to process.

(4) three parties An also provide authentication, payment and other interfaces, one of which is due to a sudden increase in transaction volume in the trustworthy payment system, which triggers the DDoS restrictions of Party An on the side of the network operator. Usually, the egress IP of the data center is fixed, so the transaction from this egress IP is mistaken for traffic * * by the network operator, resulting in the unavailability of A tripartite authentication and payment interfaces at the same time.

(5) the problem of another database is also caused by the sudden increase in the transaction volume of the reliable payment system. The upper limit given to a sequence by a colleague who created a sequence is 9999999999999, but the length of this field in the data inventory is 32 bits. when the transaction volume is small, the value generated by the system matches the 32 bits of the field, and the sequence will not rise. However, with the increase of trading volume, the sequence unwittingly increases the number of digits, resulting in 32 bits not enough to store.

Problems like this are very common and hidden in Internet systems, so how to avoid them is very important.

III. Solutions

Below we look at the changes made by Yixin payment system from three aspects.

3.1 avoid failures as much as possible 3.1.1 Design a fault-tolerant system

For example, rerouting, for user payment, users do not care about which channel their money is paid from, users only care about success. Trustworthy payment system connects more than 30 channels, and it is possible that channel A payment is not successful, so it is necessary to dynamically reroute to channel B or C, so that users can avoid payment failure through system rerouting and achieve payment fault tolerance.

And fault tolerance for OOM, like Tomcat. The memory of the system is always exhausted. If you reserve some memory for the application itself at the beginning, you can catch this exception when the OOM occurs in the system, so as to avoid this OOM.

3.1.2 some links fail quickly "fail fast principles"

The Fail fast principle is that when there is a problem with any step of the main process, the whole process should be ended quickly and reasonably, rather than waiting for a negative impact.

Here are a few examples:

(1) when the payment system starts, you need to load some queue information and configuration information into the cache. If the loading fails or the queue configuration is incorrect, the request processing process will fail. The best way to deal with this is to load the data failed and JVM will exit directly to avoid subsequent startup unavailable.

(2) the longest response time for real-time transaction processing of the payment system is 40s. If it exceeds 40s, the front-end system will no longer wait, release the thread and inform the merchant that it is being processed. The subsequent processing results will be obtained by way of notification or active query of the business line.

(3) the trustworthy payment system uses redis as the cache database, which has the functions of real-time alarm burying point and weight checking. If the connection redis exceeds 50ms, the redis operation will be abandoned automatically, and in the worst case, the impact on the payment will be 50ms, which is controlled within the range allowed by the system.

3.1.3 Design a system with the ability to protect itself

The system generally has third-party dependence, such as database, three-party interface and so on. When developing the system, we need to be suspicious of the third party to avoid the chain reaction when the third party has problems, resulting in downtime.

(1) split message queue

The Yixin payment system provides a variety of payment interfaces for merchants, including express, personal online banking, corporate online banking, refund, refund, revocation, batch payment, batch withholding, single withholding, single withholding, voice payment, balance inquiry, × × authentication, bank card authentication, card authentication and so on. The corresponding payment channels include more than 30 payment channels such as WeChat Pay, ApplePay and Alipay, and have access to hundreds of merchants. Under these three dimensions, how to ensure that different businesses, three parties, merchants, and payment types do not affect each other, what Yixin payment system does is to split the message queue. The following figure shows the split diagram of some business message queues:

(2) restrict the use of resources

The design of restrictions on the use of resources is the most important point of a high-availability system, and it is also easy to be overlooked. Resources are relatively limited and too much is used, which will naturally lead to application downtime. To this end, Yixin payment system has done the following homework:

Limit the number of connections

With the scale-out of distribution, you need to consider the number of database connections rather than endless maximization. The number of connections to the database is limited, and all modules need to be considered globally, especially the increase brought about by horizontal expansion.

Limit the use of memory

Excessive memory usage will lead to frequent GC and OOM. Memory usage mainly comes from the following two aspects:

A: the collection capacity is too large

B: objects that are no longer referenced are not released. For example, objects placed in ThreadLocal will be reclaimed when the thread exits.

Restrict thread creation

The unlimited creation of threads ultimately makes them uncontrollable, especially the thread-creation methods hidden in the code.

When the sy value of the system is too high, it means that linux needs to spend more time on thread switching. The main reason for this phenomenon in Java is that there are many threads created, and these threads are constantly blocking (lock waiting, IO waiting) and the change of execution state, which leads to a large number of context switching.

In addition, Java applications manipulate physical memory outside the JVM heap when creating threads, and too many threads use too much physical memory.

For thread creation, it is best to implement it through thread pool to avoid context switching caused by too many threads.

Limit concurrency

Those who have done the payment system should be clear that some three-party payment companies have requirements for the concurrency of merchants. The concurrency of the three parties is evaluated according to the actual transaction volume, so if concurrency is not controlled and all transactions are sent to the three parties, the three parties will only reply "Please reduce the frequency of submission".

Therefore, both the system design phase and the code review phase need to pay special attention to limit concurrency to the range allowed by the three parties.

We talked about the credibility of the payment system in order to achieve the availability of the system to make three changes, one is to avoid failures as much as possible, and then talk about the next two points.

3.2 find the fault in time

The fault is like the devil entering the village, caught off guard. When the defense line of prevention is broken, how to pull up the second line of defense in time to find faults to ensure availability, at this time the alarm monitoring system began to play a role. For a car without a dashboard, it is impossible to know the speed and the amount of fuel and whether the turn signal is on. No matter how high the level of the "old driver" is, it is very dangerous. Similarly, the system also needs to be monitored, and it is best to give an alarm in advance when there is a danger, so that it can be solved before the failure really causes the risk.

3.2.1 Real-time alarm system

If there is no real-time alarm, the uncertainty of the running state of the system will cause unquantifiable disasters. The monitoring system indicators of the Credit payment system are as follows:

Real-time-achieve second-level monitoring

Comprehensive-cover all system business to ensure no dead corner coverage

Practicality-early warning is divided into several levels, and monitoring personnel can easily and practically make accurate decisions according to the severity of early warning.

Diversity-early warning mode provides push and pull mode, including SMS, email, visual interface, to facilitate monitoring personnel to find problems in time.

Alarm is mainly divided into stand-alone alarm and cluster alarm, while trustworthy payment system belongs to cluster deployment. Real-time early warning mainly depends on the real-time data statistical analysis of each business system, so the difficulty mainly lies in the data burying point and the analysis system.

3.2.2 buried data

In order to achieve real-time analysis without affecting the response time of the trading system, Yixin payment system makes real-time data burying points through redis in each module, and then collects the buried data to the analysis system, and the analysis system analyzes and alarms according to the rules.

3.2.3 Analysis system

The most difficult thing in the analysis system is the business alarm point, such as which alarm must be sent as soon as it comes out, and which alarm only needs to be paid attention to. Let's give a detailed introduction to the analysis system:

(1) system operation architecture

(2) system operation flow

(3) system service monitoring point

The business monitoring points of Yixin payment system are summarized bit by bit in the process of daily operation, which are divided into two parts: alarm category and concern category.

A: alarm type

Network abnormal early warning

Warning of timeout and incomplete completion of a single order

Real-time early warning of transaction success rate

Abnormal state early warning

Non-return early warning

Failure notification early warning

Abnormal failure warning

Response code frequent early warning

Check inconsistent early warning

Special state early warning

B: concern class

Abnormal trading volume warning

Warning that the transaction volume exceeds 500W

SMS backfill timeout warning

Illegal IP early warning; 3.2.4 non-service monitoring points

Non-business monitoring point mainly refers to the monitoring from the point of view of operation and maintenance, including network, host, storage, log and so on. The details are as follows:

(1) Service availability monitoring

Use JVM to collect information such as the number and time of Young GC/Full GC, heap memory, time-consuming Top 10 thread stack, including the length of cache buffer.

(2) Traffic monitoring

The monitoring agent is deployed on each server through Agent to collect traffic in real time.

(3) external system monitoring

Observe whether the tripartite or network is stable by gap detection.

(4) Middleware monitoring

For MQ consumption queue, real-time analysis of queue depth through RabbitMQ script detection

For the database part, the performance of the database is monitored in real time by installing the plug-in xdb.

(5) Real-time log monitoring

Complete the collection of distributed logs through rsyslog, and then complete the real-time monitoring and analysis of logs through system analysis and processing. Finally, it is shown to the user by developing a visual page.

(6) system resource monitoring

Monitor the host's CPU load, memory utilization, upstream and downstream traffic of each network card, read and write rate of each disk, read and write times of each disk (IOPS), utilization of each disk space, etc., through Zabbix.

The above is what the real-time monitoring system of Yixin payment system does, which is mainly divided into two aspects: business point monitoring and operation and maintenance monitoring. Although the system is distributed deployment, each early warning point is a second response. In addition, there is also a difficulty in the alarm points of the business system, that is, some alarms are reported in a small amount and there is not necessarily a problem, but there will be problems in a large number of alarms, that is, the so-called qualitative change caused by quantitative change.

For example, if a network exception occurs, a stroke may be network jitter, but if more than one occurrence occurs, you need to pay attention to whether there is a problem with the network. An example of an alarm for a network abnormal and reliable payment system is as follows:

Single-channel network anomaly warning: 12 A-channel network anomalies occurred in a row within 1 minute, triggering the early warning threshold

Within 1: 10 minutes of multi-channel network anomaly warning, three network anomalies occurred in a row, involving three channels, and the early warning threshold was triggered.

Within 2: 10 minutes of multi-channel network anomaly warning, a total of 25 network anomalies occurred, involving 3 channels, triggering the warning threshold .3.2.5 logging and analysis system.

For a large system, it is difficult to record a large number of logs and analysis logs every day. The Citic payment system has an average of 200W orders per day, and a transaction flows through more than a dozen modules. Assuming that 30 logs are recorded on an order, you can imagine how large the daily log will be.

The analysis of Yixin payment system log has two functions, one is real-time log exception warning, and the other is to provide order track for operators to use.

(1) Real-time log early warning

Real-time log warning is for all real-time transaction logs, real-time capture keywords with Exception or Error and then call the police. The advantage of this is that if there is any run exception in the code, it will be found in the first place. The processing method of Yixin payment system for real-time log early warning is to first use rsyslog to complete log collection, then analyze the system real-time capture, and then do real-time early warning.

(2) order trajectory

For the trading system, it is necessary to know the status of an order in real time. The initial practice of Yixin payment system is to record the order track through the database, but after running for a period of time, it is found that the sharp increase in the number of orders leads to the excessive size of the database table is not conducive to maintenance.

The current practice of the Yixin payment system is that each module prints the log track, and the log track printing format is printed according to the database table structure. After all the logs are printed, the rsyslog completes the log collection. The analysis system will capture the printed standard log in real time, parse it and store it in the database on a daily basis, and show it to the operator's visual interface.

The log printing specifications are as follows:

2016-07-22 18 CEX16XXXXXXX5751 1515 http://10.100.444.59:8080/regression/notice||||240||2016-07-20 00.512 | | pool-73-thread-4 | | Channel Adapter | | Channel Adapter | | CEX16XXXXXXX5751 | | 16201XXXX337 | 04 | | 9000 | | [settlement platform message] processing | | 0000105 | 98XX543210 | | GHT | | 03 | 11 | | 2016-07-22 18VR 150.512 | Zhang | | 01 | tunnelQuery | true | | Pending | | 10.100.140.101 | | 8cff785d-0d01-Adapter | | 10.999.140.101 | | O001 | | 0.01 | http://10.100.444.59:8080/regression/notice||||240||2016-07-20 |

| | 2016-07-22 18-15 18:15:00.496xxxxxxxxxxxxxxxxxxxx 00.170 | | 2016-07-22 |

| | 2016-07-2019 | 06 13.000 | 01 | | 0103 | | 111xxxxxxxxxxxxxxxxxxxxxxxxx |

| | 8fb64154bbea060afec5cd2bb0c36a752be734f3e9424ba7xxxxxxxxxxxxxxxxxxxx |

| | 6xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx010 |

The visualization track of the brief log is as follows:

In addition to the above two points, the logging and analysis system also provides download and view of transaction and response messages.

3.2.6 7pm 24-hour monitoring room

The alarm items above the Yixin payment system provide operators with two ways of push and pull, one is SMS and email push, the other is report display. In addition, due to the importance of the payment system compared to other Internet systems, Yixin payment system uses a 24-hour monitoring room to ensure the security and stability of the system.

3.3 handle failures in a timely manner

After the fault occurs, especially in the production environment, the first thing to do is not to find the cause of the fault, but to deal with the fault as quickly as possible to ensure the availability of the system. The common failures and treatment measures of the Citic payment system are as follows:

3.3.1 automatic repair

For the automatic repair part, the common failures of Yixin payment system are caused by tripartite instability, in view of this situation, that is, the system mentioned above will be rerouted automatically.

3.3.2 Service degradation

Service degradation refers to turning off some functions in case of failure and can not be repaired quickly to ensure the use of core functions. When the Citic payment system is aimed at merchants' promotion, if a merchant's transaction volume is too large, the merchant's traffic will be adjusted in real time, so that the merchant's service will be downgraded, so that other merchants will not be affected. There are many similar scenarios. Specific service downgrading features will be described in the following series.

IV. Qyoga

Q1: can you tell me the details and treatment plan of that RabbitMQ down that year?

A1: the downtime of RabbitMQ led to thinking about the availability of the system. At that time, our RabbitMQ itself did not downtime (RabbitMQ was still very stable). The downtime was the hardware machine where RabbitMQ was located. But the problem was that at that time, the deployment of RabbiMQ was a single point of deployment, and people used to think that RabbitMQ would not be down, thus ignoring its container. Therefore, the emergence of this problem for us is that all business can not have a single point, including application servers, middleware, network equipment and so on. A single point not only needs to be considered from a single point itself, such as double copies of the whole service, and then AB testing, of course, there are also two computer rooms.

Q2: are the development, operation and maintenance staff of your company together?

A2: our development and operation and maintenance are separate. Today's sharing is mainly considered from the level of usability of the whole system. There are too many developers and some operation and maintenance. I have witnessed the journey of these trustworthy payment systems all the way.

Q3: do you all use Java in the background? Have you considered other languages?

A3: most of our current systems are java, and there are a few python, php and Cellular languages, depending on the type of business. Java is the most suitable stage for us. Other languages may be considered as the business expands.

Q4: be skeptical about third-party dependence. Can you give me a specific example of how to do it? What if the third party doesn't work at all?

A4: systems generally have third-party dependencies, such as databases, three-party interfaces and so on. When developing the system, we need to be suspicious of the third party to avoid the chain reaction when the third party has problems, resulting in downtime. We all know that once there is a problem in the system, it will snowball and get bigger and bigger. For example, if we scan the code channel, if there is only one code scanning channel, there is nothing we can do when there is a problem with the code scanning channel, so we doubt it at the beginning. By accessing multiple channels, if there is an exception, the real-time monitoring system automatically switches the routing channel after triggering the alarm to ensure the availability of the service. Second, split asynchronous messages for different payment types, merchants, and transaction types to ensure that if an unpredictable exception occurs in one type of transaction, other channels will not be affected. This is like a highway with multiple lanes, where fast cars and slow lanes do not affect each other. In fact, the overall idea is fault tolerance + split + isolation, this specific problem to deal with specifically.

Q5: after the payment timeout, there will be network problems, will there be money paid, order lost, how to do disaster recovery and data consistency, and whether to replay the log and repair the data?

A5: the most important thing to make payment is security, so we all adopt a conservative processing strategy for the order status, so we all set the processing status for orders with network anomalies, and then finally complete the final consistency with the bank or the three parties through active query or passive acceptance of notifications. In the payment system, in addition to the order status, there is also a response code problem. We all know that the bank or the three parties respond through the response code, and the translation of the response code and order status must also be conservative to ensure that there are no problems such as overpayment of funds and underpayment. In short, the general idea of this point is that capital safety comes first, and all strategies are whitelist principles.

Q6: as mentioned just now, if a payment channel times out, the routing policy will be distributed to another channel. According to that channel map, there are different payment methods, such as Alipay or WeChat Pay, so if I just want to pass WeChat Pay, why not try again, but switch to another channel? Or does the channel itself mean to request a node?

A6: first of all, rerouting cannot be done for timeouts, because socket timeout cannot determine whether the transaction has been sent to three parties and whether it has been successful or failed. If it is successful, try again. If successful, the payment is overpaid. The loss of funds in this case is not acceptable to the company. Secondly, for the routing function, it needs to be divided into business types. If it is a single receipt and payment transaction, the user does not care which channel the money goes out, and it can be routed. If it is a scanning channel, if the user uses Wechat to scan the code, it must end up going to Wechat, but we have a lot of intermediate channels. Wechat goes out through intermediate channels. Here we can route different intermediate channels. In the end, it is still WeChat Pay for the user.

Q7: can you give an example of the process of automatic repair? How do I discover the details of unstable to rerouting?

A7: automatic repair is to do fault-tolerant processing through rerouting, which is a very good problem. If you find instability, then make a decision on rerouting. Rerouting must be clear that the current rerouted transaction is not successful before it can be routed, otherwise it will cause the problem of overpaying and overcharging. At present, the rerouting of our system is mainly made through two ways: after the event and in the event, for example, if a channel is found to be unstable within 5 minutes through the real-time early warning system, then the transactions after the current period will be routed to other channels; in the event, mainly through the analysis of the failure response code returned by each order, the response code is sorted out, and rerouting can be done only if it can be rerouted. Here I mean to list these two points, other business points are still very many, in view of the length of the reason, do not elaborate, but the general idea is that there must be a memory real-time analysis system, second-level decision-making, this system must be fast, and then combined with real-time analysis and offline analysis to do decision support, our real-time second early warning system does this thing.

Q8: is there a regular sales promotion for merchants? What is the difference between the peak value of the promotion and the usual one? Is there a technical drill? What is the priority of demotion?

A8: in general, we will often communicate with merchants in advance to know the time point of promotion and promote sales, and then do something targeted; the peak value of sales promotion is very different from usual, and there are usually more promotions within 2 hours. For example, when some merchants sell financial products, the promotion is concentrated within 1 hour, so the peak value is very high. The technical drill is that we understand the sales promotion of merchants, then estimate the processing capacity of the system, and then do the drill in advance. The priority of downgrading is mainly aimed at merchants. Since there are many payment scenarios connected to our merchants, such as financial management, collection and payment on behalf, speed, code scanning, and so on, our overall principle is that different merchants must not influence each other. Because you can't affect other businesses because of your sales promotion.

How is the Q9:rsyslog collection log stored?

A9: this is a good question. At the beginning, our log, that is, the order track log, is recorded in the database table. It turns out that an order flow requires a lot of modules, so the log track of an order is about 10. If there are 400w transactions a day, there will be a problem with this database table. Even splitting will affect database performance, and this is an auxiliary business and should not be done. Then, we found that writing a log is better than writing a database, so the real-time log is printed in the form of a table and printed to the hard disk. Because it is only a real-time log, the log volume is small, which is in a fixed directory of the log server. Because the logs are all on the distributed machine, and then collected to a centralized place, this piece is stored by mounting, and then a program written by a special operation and maintenance team is written to parse these tabular logs in real time. Finally, it is displayed to the operation operation page through a visual page, so that the order tracks seen by operators are almost real-time, and it is actually not a problem for you to care about how to store them. Because we divide the real-time log and offline log, and then the offline log after a certain period of time will be cut and eventually deleted.

Q10: how do system monitoring and performance monitoring work together?

A10: the system monitoring I understand includes system performance monitoring, which is part of the overall system monitoring. There is no coordination problem. System performance monitoring has multiple dimensions, such as application level, middleware, containers, etc. The non-business monitoring of the system can view and share articles.

Author: Feng Zhongqi

Source: Yixin Institute of Technology

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.