The way of architecture training | there are several "dead" methods for a traditional gateway system. 05/10 Update SLTechnology News&Howtos

The way of architecture training | there are several "dead" methods for a traditional gateway system.

2025-05-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is excerpted from "the way of Architectural practice" by JD.com Wang Xindong.

Figure | meghan-holmes-779221-unsplash

The tradition here is divided according to the stage of the evolution of gateway technology, from synchronous to semi-synchronous to fully asynchronous. We call gateways under synchronous and semi-synchronous technology as "traditional" gateways. Synchronous gateways mean that the process from receiving requests to calling API interface providers is synchronous calls. Semi-synchronous means to separate the Ihammer O request thread from the business processing thread, but the business thread still calls the API API synchronously; the meaning of full async is clearer, and the whole link is asynchronous request. Next, we will introduce the circumstances under which "traditional" gateways will "down drop".

The API gateway system has two characteristics, one is the large number of visits, and the other is dependent on the system. As shown in the following figure, in "simple" cases (for example, the interface provided by system An is only called by the gateway), the gateway system is subject to most times more traffic than the dependent system, because the API gateway is a collection of all dependent API. The gateway will also call many underlying systems through RPC, the stability level of each system is uneven, and the performance of the interface will indirectly affect the overall operational stability of the gateway. Therefore, when we take precautions, we should start with these two characteristics.

The above describes the two major characteristics of the API gateway, these two characteristics are external factors, and then take a look at the internal factors. Programs are run on the computer, and the utilization rate and load level of each part of the computer directly affect the operation of the program. Such as CPU, memory, disk, etc. In addition, the interaction between systems also needs the network, which need to be considered. The operation of a program in a computer depends on the components shown in the following figure.

Follow CPU

We have technically isolated the Servlet3 request thread from the business processing thread when the user request enters the gateway, which can be realized by using the asynchronous feature of Servlet3 (the asynchronous feature of Servlet3 is also described in detail below), as shown in the following figure.

There is no doubt that the business thread pool runs in CPU. Threads are the most valuable resource of computer CPU. We must focus on CPU utilization and CPU load.

CPU utilization: shows the percentage of CPU that the program occupies in real time while it is running.

CPU load: shows the average number of tasks in use and waiting to use CPU over a period of time. On Linux systems, we can use the uptime or top (see more details with top) command to see the load on the system. If you use the uptime command, you will get the following line:

11:36 up 23 days, 2:31, 2users, load averages: 1.74 1.58 1.60

The last loadaverages means the average load of the system, which contains three numbers that represent the average load of the system within 1 minute, 5 minutes, and 15 minutes, respectively. We can take the first number according to the granularity of 1 minute to determine the size of the system load.

The above 23days shows that the author hasn't restarted his computer for 23 days.

Note that the high utilization of CPU does not necessarily mean that the load is high, and there is no inevitable relationship between the two.

We can also give an example to illustrate the understanding of these two concepts. There are eight people waiting in line to play a gopher game machine, demanding that 100 gophers be played within a minute. If someone doesn't finish the task within a minute, they need to wait in line again for the next round. Game consoles are the equivalent of CPU here, and the number of people who are playing or waiting to play gopher is equivalent to the number of tasks.

In the process of playing the game, there must be some people who finish 100 gophers within a minute, then leave after completing the task, some people do not finish the task and go back in line, and there may be new people to play the game. The change in the number of people is equivalent to the increase or decrease of the task. Some people pick up a gopher hammer and start playing for a minute, while others may check their cell phone in the first 20 seconds and start playing gopher in the next 40 seconds. Regarding the game console as CPU and the number of people in line as the number of tasks, we say that the CPU utilization rate of the former type of person (task) is high, while the CPU utilization rate of the latter kind of person (task) is low.

Of course, CPU will not rest in the first 20 seconds and work in the last 40 seconds. It is only said that some programs may involve a large amount of computation and high CPU utilization, while some programs involve fewer calculations and low CPU utilization. No matter whether the CPU utilization is high or low, it has nothing to do with how many people (tasks) are queuing behind.

Some space has been devoted to introducing these two concepts of CPU, because these two indicators are so important that they need to be monitored in an online production environment. In view of the large number of visits and dependence on the system of the API gateway, if the performance of the called API suddenly deteriorates, in the case of a large number of visits, the number of threads will gradually increase until the CPU resources are exhausted. Spread to the entire gateway cluster, this is the effect of avalanche.

Pay attention to disk

Two important metrics for disks are disk utilization and disk load percentage. Disk usage is easy to understand, so let's focus on the percentage of disk load. The command to view this metric under the Linux system is iostat-x 1 10 (if there is no iostat, you need to use yum install sysstat for installation). The example value in the figure below does not pose a threat, but if the% util is close to 100%, it means that there are too many util O requests, and the I O system is full, and the disk may have a bottleneck, as shown in the following figure.

In the process of running the program, we may not pay attention to the use of the disk, if not handled properly, this may be a "time bomb". The gateway features a large number of visitors, coupled with some procedures inside the log printing is not standard, such as the log level is not set properly, print out the info log. Even when the log level is reasonable, such as error logs, the second feature of the gateway is involved, which is dependent on the system. When there is an API failure error, a large number of error logs will be written to the disk, so it is easy to fill up the disk. Especially in the container era, the disk capacity allocated by each server is relatively small compared with the physical machine. If all the machine disks in the cluster are filled up, it will undoubtedly be a disaster for the gateway system.

Pay attention to the network

In the micro-service system architecture, applications can not be separated from the network, especially the gateway system, one of its characteristics is that it depends on many systems. Dependencies are RPC calls and networks. In a RPC environment, the network accounts for a large proportion of the time spent on a RPC call. The quality of the network directly affects the time it takes for a request to enter the API gateway and return to the user. As shown in the figure below, the network between gateway and dependent system B suddenly deteriorates, and the call time increases. When there is a large number of requests, the one-request-one-thread mode will directly lead to an increase in the number of task threads in the API gateway system. If it cannot be restored within a short period of time, the CPU resources of all machines in the entire API gateway cluster will be exhausted by threads.

At the same time, the deployment of the existing online production environment is not fully guaranteed to be called in the same computer room, or even across regions, so the network is an important factor to be considered. at the same time, the network factor needs to be related to the thread resources of the CPU mentioned above.

Now it can be concluded that a traditional API gateway system will have several "dead" methods, because the sudden deterioration of the API performance of a dependent system leads to a gradual increase in the number of request threads until the thread is full of CPU, that is, the characteristic factor of the API gateway dependent system, which can be considered to be "dragged to death" by other systems. Non-standard log output under the online production environment, excessive printing of logs, and a sudden increase in the number of requests, resulting in too much time for the cleaning tool to clean the logs. Finally, the disk is full, which can be considered as "killed" by the logs. The network has always been the most unstable factor except the system itself. when calling between systems, the network failure causes the request to slow down, which is similar to the first item being "dragged to death" by other systems, except this time it is the network.

Charlie. Munger has a famous saying: "if I knew where I was going to die, I would never go there." Similarly, for an API gateway system, if we know what factors will cause a gateway to "hang up", then we will take precautions in advance to avoid such a "disaster". Of course, it is not to promote that the traditional gateway is not good, it also has its own advantages, such as simple programming model, convenient development and debugging operation and maintenance, and so on. If the business scale is small, for example, the daily call volume is less than 10 million, or less than 100 million, you can continue to use this type of gateway, and even after reaching the 100 million scale, combined with an effective fault-tolerant mechanism (such as Netflix's zuul1+Hystrix), you can also support hundreds of millions of visits. However, we have a better asynchronous gateway solution, so let's introduce the implementation of asynchronous gateway technology.

The author: Wang Xindong

He now works for JD.com, the author of the official account of "Program erection". I love to summarize and share in peacetime, and have deep research and practical experience in high-performance API gateway, thread tuning, NIO, micro-service architecture, fault tolerance and other technologies. At present, we are committed to leading the team to achieve a breakthrough in the field of platform open technology.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.