What are the stages of the construction of the payment system of the website? 07/11 Update SLTechnology News&Howtos

What are the stages of the construction of the payment system of the website?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "what are the stages of the construction of the payment system of the website". In the course of the operation of the actual case, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

I. availability stage

The early business flow is not very large, and the business logic of the channel gateway system is also very simple. In a word, the summary is: let users pay the money smoothly when they trade. The things you do can be simply summarized as three things: initiating a payment request, receiving a successful payment notification, and returning the payment account to the user when the user requests a refund. At this stage, the system practice is relatively simple, mainly "short, flat, fast", quickly access the new third-party payment channels and ensure that it can be used. The system architecture is shown in figure 1.

II. Available stage

In the rapid iterative process in the early stage of system evolution, there are not many third-party payment channels connected, the system runs relatively smoothly, and some simple problems can also be solved manually by developers. However, with the increasing number of third-party payment channels, some new problems are gradually exposed:

(1) all business logic is in the same physical deployment unit, and different businesses influence each other (for example, the refund business has problems, but at the same time drags down the payment business)

(2) with the increase of business flow, the pressure on the database increases gradually, and the occasional fluctuation of the database makes the system unstable, which has a great impact on the payment experience of users.

(3) the synchronization of payment and refund status largely depends on the asynchronous notification of the third-party payment channel, once there are problems in the third-party payment channel, resulting in a large number of customer complaints, the user experience is very poor, and the development and operation are very passive.

In order to solve the problem of interaction between businesses in (1), we first consider to split a large physical deployment unit into multiple physical deployment units. There are two obvious splitting strategies to choose from

Split according to the channel, different third-party payment channels have an independent physical deployment unit, such as one deployment unit for Wechat, one deployment unit for Alipay, etc.

Split according to the business type, different businesses have a separate physical deployment unit, such as a payment business deployment unit, a refund service deployment unit, and so on.

Considering that under the traffic scale at that time, the priority of payment business is the highest, and the priority of business such as refund is low; while the proportion of traffic in some channels is very small, as an independent deployment unit, it will cause a certain waste of resources and increase the complexity of system maintenance. Based on this, we made a trade-off in line with the scale of the system at that time: chose the second split strategy-split according to the business type.

In view of the DB pressure problem in (2), we analyze the reason together with DBA, and finally choose the Master-Slave scheme. Ease the query pressure by adding Slave; ensure the strong consistency of business scenarios by forcing Master; and use the company's DB middleware Zebra to do load balancing and disaster recovery switching to ensure the high availability of DB.

Aiming at the problem of state synchronization in (3), we sort out different channels and solve most of the state synchronization problems by actively querying the timing batch synchronization status on the basis of the asynchronous notification of the existing third-party payment channels. For a small amount of Case that is not yet synchronized, the system opens API for internal use to facilitate background access and manual filling by developers.

After the completion of the above practice, the channel gateway system has reached the basic availability stage. Through the internal monitoring platform, we can see that the availability of core service interfaces can reach more than 99.9%. The evolved system architecture is shown in figure 2.

III. Flexible availability stage

After solving the problems of service isolation, DB pressure, state synchronization and so on, the channel gateway system has gone through a period of stable availability. However, it can not stand the pressure of the rapid growth of business. Some anomalies such as small system fluctuations and traffic shocks under the scale of business traffic are sharply magnified when they encounter the flood peak of traffic, and may eventually become the last straw to crush the system. With the new scale of business traffic, we are faced with new challenges:

The main results are as follows: (1) with the growth of the team, when new students access new channels or add new logic, they will often give priority to completing tasks in a familiar way. But what is familiar is not necessarily reasonable, and new risks may be introduced. Especially when docking with third-party channels, the HTTP interaction frameworks currently used by the system are JDK HttpURLConnection/HttpsURLConnection, Httpclient3.x and Httpclient4.x (there are different small versions within the 4.x version). I have stepped on several painful pits on this alone.

(2) after the service is split by business type, different businesses no longer affect each other. But within the same business, when the traffic scale was small, the occasional fluctuation had little impact, but now when the traffic increases, different channels begin to influence each other. For example, payment business provides distributed payment API, and all channels share the same service RPC connection pool. Once the performance of the payment interface of one channel deteriorates, a large number of service RPC connections are occupied, and requests from other normal channels cannot come in. The deterioration of the performance of the failure channel directly causes the user to be unable to pay successfully through the channel, and the chain reaction causes the user to retry many times, which further aggravates the deterioration, and finally causes the system avalanche and denial of service. and the restarted service may be destroyed again by a large number of failed channel retry requests.

(3) the current third-party payment channels, whether they are third-party payment companies, banks or other external payment institutions, basically guide users to complete the final payment action by means of redirection or SDK. In this payment link, the channel gateway system only interacts with the third-party payment channel at the back end (generating payment redirection URL or pre-payment voucher), and the end-user payment result can only be known through the asynchronous notification of the third-party payment channel or through its own active payment query. Once a failure occurs within a third-party payment channel, the channel gateway system can not know that the payment link has been damaged, which damages the user's payment experience.

(4) the DB of the existing channel gateway, some non-channel gateway services can still be accessed directly, which brings risks to the DB stability and DB capacity planning of the channel gateway system, and then affects the availability of the channel gateway system. Internally, it is said to be wearing a "green hat".

(5) for the refund link, the system does not collect, sort and classify the refund abnormal case at present, and lacks a clear refund link monitoring. As a result, after the user applied for a refund, the refund request of a small number of users was not processed successfully, and the user initiated a customer complaint. At the same time, due to the lack of monitoring, this abnormal refund lacks a follow-up promotion measure. In extreme cases, it causes a second customer complaint, which greatly damages the user experience and the credibility of the company.

In order to solve the risk described in problem (1) to the greatest extent, after absorbing the painful lesson of trampling, we collect and organize different application scenarios for third-party channel docking, and abstract a set of access framework. The access framework defines a whole set of gateway interaction processes such as request assembly, request execution, response parsing and error retry, shielding the underlying HTTP or Socket interaction details and providing corresponding extension points. In view of the special application scenario that there is a front machine in bank channel access, connection pool (Conn Pool) and simple load balancing mechanism (LB, which provides Round Robin routing strategy) are abstracted based on Netty. Different channels can insert custom assembly policies (extend existing HttpReq, HttpsReq or NettyReq), execute policies [extend existing (Http, Https or Netty) Sender/Receiver], parse policies (extend existing HttpResp, HttpsResp or NettyResp), and reuse content resolution (binary/xml/json parser), certificate loading (keystore/truststore loader) and encryption and decryption signature (encrypt/decrypt/sign/verify sign) components already provided by the framework. In order to improve the efficiency of channel access, at the same time, reduce the risk of new channel access as much as possible. The process structure of the access framework is shown in figure 3.

In order to solve the problem (2), a simple and intuitive way to solve the problem is channel isolation. How to isolate and to what extent? These are two main problem points:

How to isolate and consider splitting the payment service further according to the channel and keeping the system small, but after the split, the caller of the payment API needs to distinguish between different channels to call different payment API interfaces, which is tantamount to throwing the channel isolation problem to the caller. At the same time, the number of services increases after the split, and the caller needs to maintain multiple different RPC-API of the payment business in the same channel, which increases the complexity and increases the maintenance burden of developers, which is not desirable under the current scale of business traffic. So we chose to isolate the channels within the same payment service API. Since the same payment service API connection pool is shared, the primary goal of channel isolation is to prevent faulty channels from occupying a large number of AP connection pools, which has a negative impact on other normal channels. If the fault channel can be detected automatically and the request of the fault channel is quickly failed in the early stage of the failure, the isolation of the fault channel is completed automatically from the business logic.

To what extent there are different payment methods (credit card payment, debit card payment, balance payment, etc.) under one payment channel, while some payment methods (such as credit card payments) still have multiple banks. So we directly define the minimum granularity of channel isolation to payment channel-> payment method-> bank.

Based on the above considerations, we design and implement a quick failure (fail-fast) mechanism for failure channels:

The payment information attached to each payment request is abstracted as a specific fail-fast path, and the request is abstracted as a fail-fast transaction. If the request succeeds, the transaction succeeds, otherwise, the transaction fails.

During the execution of a fail-fast transaction, there are two fail-fast circuit breakers in the cascade:

A static switch that determines whether a payment request needs to fail quickly based on manual configuration (on/off).

The dynamic switch determines the current health status according to the historical statistics, and then determines whether the current payment request is quickly failed.

The dynamic circuit breaker abstracts three health states (closed- releases all requests; half_open- partially releases requests; open- quickly fails all requests) and maintains a state machine of health state transition according to historical statistics (total requests / request failures / request anomalies / request timeouts). The state transition is shown in figure 4.

Each state change of the state machine will produce a health event, and the cashier service can monitor this health event and realize the linkage online and offline switching of the payment channel.

Historical statistics are dynamically updated after each payment request is completed.

According to the observation of online traffic simulation pressure test, the fail-fast mechanism adds extra 1~5ms time to the system payment request, which is 1% to 2% compared with the payment interface of the third-party channel, which belongs to the controllable range. After the channel fault fail-fast mechanism is put online, combined with the pressure test configuration, after several fine adjustments, the fail-fast configuration parameters of the online environment are stabilized.

In the case of a payment failure in a channel not long ago, through the company's internal monitoring platform, it is obvious that the fail-fast mechanism plays a good fault isolation effect, as shown in figure 5 below.

In order to solve the problem (3), the monitoring of the availability of the payment link depends on the internal monitoring platform of the company to report and monitor the trend curve of the payment success notification in real time; at the same time, the channel gateway system realizes the end-to-end monitoring of the payment link from the business level. Monitor the total amount of end-to-end payment success and the success rate of payment in seconds, and provide real-time payment link mail or SMS alarm based on the historical statistics of these two indicators. During peak traffic, the monitoring can also be degraded manually (asynchronized or turned off). This greatly improves the developer's response speed to the failure of the core payment link.

In order to solve the "green hat" in problem (4), the channel gateway system cooperates with DBA to recover DB direct access of all external systems, and provides a replacement API for external system access, which lays a foundation for subsequent improvement of DB stability, DB capacity planning and possible asynchronous multi-computer room deployment.

Aiming at the refund case in question (5), the channel gateway system cooperates with other transactions and payment systems on the refund link to collect, sort and classify the abnormal case of the third party channel from the source, and form the core index of the refund link (refund day success rate / next day success rate / 7 day success rate) monitoring. The system practice of this part will be shared with the subsequent "unified optimization of refund link".

With the gradual completion of the above practice, the availability of the channel gateway system has been significantly improved, and the availability of the API interface of the core link has reached 99.99%. In the company's 917 promotion, the channel gateway system has smoothly passed the traffic peak and ushered in a new record: the TPS for submitting third-party channel payment requests has reached an all-time high. And when part of the channel interface fails, it can ensure the stability of the core payment API interface, and achieve the automatic detection and recovery of the fault channel, and realize the linkage online and offline switching of the corresponding channel at the cashier. At the same time, the success rate of payment in the core payment link is monitored to realize the manual switching of the channel on and off the line in the event of internal failure of the third party channel. So far, it basically ensures that when some third-party channels are damaged, the flexibility of the channel gateway system is available. The system architecture at this stage of evolution is shown in figure 6.

This is the end of the content of "what are the stages of the construction of the payment system of the website". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.