Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Practice of full-chain pressure measurement automation

2025-02-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Domestic holiday is a low-frequency, typical holiday-related business, the flow in the holidays will be five to ten times higher than the usual, which will bring great risk to the production system. Therefore, before the Spring Festival in 2018, based on Meituan's basic stress testing platform Quake, we connected the entire domestic holiday business to full-link stress testing to systematically evaluate capacity and identify hidden dangers, and finally ensure the stability of the system during the Spring Festival.

In the whole process, we realize that the full-link pressure test plays an important role in the stability construction of the whole system, and it is also the most effective scheme. Combined with the frequency of actual business holidays (basically once a month), if it can be used as a conventional means of stability guarantee, the quality of our system can also be well guaranteed. At the same time, in order to solve the pain points in the process of periodic normal stress testing, such as high labor cost, repeated work of multiple teams, uncontrollable safety and high risk, we put forward the idea of full-chain pressure test automation.

By uniformly combing the specific actions of the pressure test, promote standardization and automation in each stage of the pressure test, and try our best to improve the execution efficiency of the whole process, and finally achieve the goal of normalization, as shown in figure 1:

In addition, in the whole cycle of full-link pressure testing, the safety and effectiveness of pressure testing are also the quality attributes that need to be paid close attention to all the time. Based on these thoughts, as shown in figure 2, we classify and decompose the key problems that need to be solved in stress test automation:

How to automate the basic process and improve human efficiency

How to automatically verify the pressure test to ensure the safety of the pressure test?

How to quantify the confidence of the pressure test to ensure the effectiveness of the pressure test.

Finally, based on Meituan's pressure testing platform Quake (which mainly provides the functions of flow recording, playback and pressure in the whole system), a full-link automatic pressure testing system is designed and implemented to improve the efficiency of full-link pressure testing for different services and ensure the safety of pressure testing. The system:

Provide a link carding tool, which can automatically construct the complete dependent information of the pressure test entrance link and assist the link carding.

Link labeling and configuration functions are supported. For dependent interfaces that do not need to be reached by pressure test, the Mock configuration of related interfaces can be completed by means of configuration without embedding stress test judgment logic in the business code.

Provide abstract data construction interface, through the platform, users can configure arbitrary data construction logic and process

In the pre-stress test / pressure test, automatically check the pressure test service and flow to ensure the safety of the pressure test.

On weekdays, the pressure test verification of periodic small flow is provided based on the pressure test plan, so that the pressure test security risk caused by business iterative changes can be discovered as soon as possible.

Provide the pressure test plan management function, liberate the manpower through the system automatic scheduling and control of the pressure process; at the same time, force the pre-stress test, and improve the safety.

One-click pressure test, automatically generate reports, collect link entry and alarm information, provide problem recording and follow-up functions.

System design

System overall design

The overall logical architecture of the system, as shown in figure 3, mainly includes link construction / comparison, event / index collection, link management, stress test configuration management, pressure test verification check, data construction, pressure test plan management, report output and other functional modules. Through these modules, we can provide support for the whole process of full-link stress testing, and try our best to reduce the threshold and cost for business departments to use full-link stress testing.

Link construction / comparison: responsible for the construction, update and storage of the service interface method invocation link.

Link governance: based on the constructed link relationship, it provides functions such as core dependency in the link, egress Mock interface tagging, upstream and downstream analysis, display, and egress Mock configuration.

Pressure test configuration management: automatically discover the pressure test configuration of registration service Mafka (a distributed message middleware solution developed by Meituan based on Kafka) / Cellar (distributed KV storage service based on Tair) / Squirrel (distributed cache system based on Redis-Cluster mode) / Zebra (Meituan database access layer middleware), assist the pressure test side to check and configure related configuration items.

Pressure test verification check: to ensure that the system can be pressure test, through a variety of verification means and mechanism design to ensure the safety of the pressure test.

Data structure: prepare basic and flow data for different business stress test implementation.

Pressure test plan management: set the pressure test execution plan, and rely on the "pressure test control" module to automatically schedule the whole pressure test execution process.

Fault diagnosis: according to the collected key business / service indicators, alarm and other information, judge whether the analysis service is abnormal, and whether to terminate the pressure test.

Confidence evaluation: evaluate the confidence of the stress test results from the dimensions of data coverage, link coverage and technical indicators, that is, the similarity with the real traffic.

Description of non-functional requirements:

Expandability

It can be compatible with the differences of data construction logic of different business lines.

Can support different traffic recording methods.

Security.

Integrate SSO, group according to the team to which the user belongs, and display the pressure test service information to which you belong. Keep operation logs for key operations.

Pressure test verification inspection is the key to ensure the safety of pressure test. Periodic pressure test verification is supported, and the degradation of testability of the service to be tested over time can be found.

Reusability

In the long run, modules such as link construction, event / indicator collection / fault diagnosis are reusable infrastructure in the field of stability, built according to independent general modules.

Constraint description:

Built based on Quake, the recording, playback and pressure of traffic depend on Quake.

The design of some key modules are introduced in detail below.

Design of Link Management Module

The link management module is based on the link construction module. The link construction module, which stores the link relationship between the two dimensions (service and interface) in the form of a closed table, is automatically built or updated periodically.

The link management module mainly provides the functions of link entrance selection, link labeling, service exit analysis, exit Mock configuration and so on. As shown in figure 4, the registered stress test service constitutes the scope of the stress test service, which determines the boundaries of each link. Through the link relationship of the tree structure which is automatically constructed by the system, it can assist the pressure tester to comb the whole link, which solves the problems such as the inefficient means of link carding, such as code flipping, and the lack of full link perspective.

At the same time, for the whole range of pressure measurement, the dependent interface can be marked manually. Which need Mock and which do not need Mock, so that the unique link information of pressure testing can be maintained continuously.

For the external interface that requires Mock (such as interface C in figure 4), the pressure test system obtains the ability of export configuration SDK by introducing proprietary Mock. As shown in figure 5, the basic capabilities of Meituan's wine travel Mock platform are used here, and JVM-Sandbox is used as an AOP tool to dynamically enhance the configuration of external interfaces that require Mock. When the API is called, determine whether it is the pressure test traffic, if so, follow the Mock logic, do the simulation delay processing, and return the response data configured in advance. In this way, first, the operation of the export Mock is simplified, and the Mock logic 0 in the business code invades; second, the local Mock and the two solutions with the aid of Mockserver are replaced by one solution to facilitate unified management; third, in the actual stress test, the platform can also collect the data executed by Mock logic through SDK and automatically compare it with the Mock data marked in the background to ensure that the export that should be Mock is really Mock.

Design of data construction module

The purpose of the data construction module is to solve the differential construction process of basic data and traffic data in different businesses. Two key concepts are put forward: data construction logic and data construction flow. Data construction logic is a fine-grained reusable basic unit of data construction, represented by a piece of Java code. The platform provides a unified and abstract data construction interface. Based on Java dynamic compilation technology, a Java version of script engine is developed, which supports online editing and updating of construction logic. At the same time, based on the generalization invocation ability of Meituan RPC middleware, a generalization invocation tool is constructed to help users integrate the calls of external basic data construction interfaces into a data construction logic.

The data construction process defines the whole process of the generation of pressure test basic data and flow data. Through the interaction with Quake, the original and real online data are obtained, and a simplified version of the process engine is constructed. In the uniformly set process, as shown in figure 6, the entire data structure execution process is defined by configuring different types of data construction logic and execution order in the standard expansion slot. Finally, the constructed traffic data is bound to the Quake stress test scene as the source of the scene playback traffic in the subsequent Quake stress test.

Through this design, it can support arbitrary data construction logic and is universal and flexible. At the same time, it integrates the existing traffic recording function of Quake, and executes the data construction process with one button, which greatly improves the efficiency.

Design of pressure test verification module

The guarantee of the safety of pressure testing has always been a difficult point in automation. Most of the previous experience is to rely on the manual confirmation of the relevant persons in charge of different services in the process of pressure testing or pre-stress testing in non-production environment. This paper provides two new thinking angles for the pressure test verification: one is from the point of view of the compressibility of the service system to be tested, and the other is from the point of view of the flow characteristics of the pressure test. For the first point of view, a service that supports stress testing needs to meet the isolation of pressure test data and flow. For different system ecology, the points that need to be met are different. For the services under Meituan's ecology, the conditions for pressure testing include component version supporting stress testing, shadow storage configuration in line with expectations, and so on.

Starting from these conditions, you can get the following static check items:

Service-dependent middleware version requires verification

Zebra pressure test configuration check

Cellar/Squirrel pressure test configuration check

Synchronization and Calibration of Mafka pressure Test switch

Service Mock logic existence check.

From the second point of view, it is concerned about what unique flow characteristic data will be generated under the pressure test flow, through these unique data to ensure the safety of the pressure test. There are mainly three types of data: the pressure test mark data of the calling link in Meituan distributed tracking system (MTrace) (the normal pressure test link should always have the pressure test mark until the boundary node of the pressure test range, see figure 4); the operation data reported when the external interface marked Mock is called; and the unique monitoring data based on the pressure test flow obtained by the monitoring system. Using these data, we design three kinds of dynamic check items, and find anomalies such as missing pressure test mark, calling Mock exit and so on.

MTrace link mark check, starting from the pressure test link entrance, collect the pressure test link information, and verify whether the pressure test mark information transmission is in line with the expectation.

Fig. 8 MTrace link tag check schematic

Service Mock logic stress test mark verification, through the enhanced verification logic, the execution information is reported to the platform, and compared with the label data of Mock configuration.

Figure 9 service Mock pressure test check schematic

Compare the pressure test with the real link, use the link management module to build the link, collect the pressure test monitoring data to reconstruct the link, and compare it with the real link.

In addition to defining the static and dynamic pressure test verification rules, in the specific process arrangement, these rules are implemented in the two periods of pressure test time and weekdays. It can not only disperse the pressure of the pressure test and verification to the normal time, but also find out the new risks introduced by the code iteration as soon as possible.

In the pressure test, the safety is ensured through the process design of the pre-stress test and the automatic execution of the static / dynamic pressure test check items. The verification fails, an alarm is given, and even the set pressure test plan is terminated directly if allowed.

On weekdays, by performing periodic small flow pressure test and calibration, the QPS is finely controlled in the process of pressure application, so as to quickly find the degradation of pressure test safety in the pressure test range with as little cost as possible.

Design of pressure Test Plan Management Module

The pressure test plan management module provides the pressure test plan setting in advance, and then the module can automatically schedule and control the whole pressure process. As shown in figure 11, the stress test plan here is a combination of multiple stress test scenarios, including information such as QPS's growth plan, which is mainly divided into two stages: pre-stress test and formal stress test. The automatic implementation of the pressure test plan can solve the problems such as the combined stress test of many scenes, the time-consuming operation, the inability of the multi-scene pressure test QPS to change synchronously, and the failure of the pressure test side to strike a balance between operation and observation, which improves the efficiency. At the same time, in the stress test plan execution state machine, the pre-stress test is completed normally, the state can be transferred to the starting state of the formal pressure test, and the safety of the pressure test is improved.

As can be seen from figure 11, the pressure test planning module is the core of the whole automatic pressure test, which cooperates with each module. Through the events generated by the execution of specific planned tasks, the pressure test verification check, the stress test progress broadcast, the collection of pressure test monitoring / alarm and other data are triggered to detect whether the service is abnormal, and terminate the stress test according to the configuration, so as to stop the loss in time in case of failure. Finally, the report generation module receives the pressure test termination event, summarizes all kinds of information, and automatically generates the pressure test report including multi-dimensional information such as the basic information of the pressure test, which saves some time of analysis after the stress test.

Case sharing

The following is a case study based on the actual stress test process.

Team / Service Registration

Set the virtual team to implement the stress test and the application service of the pressure test coverage.

Link governance

If the pressure test link entrance is selected, the interface link relationship tree below the entrance can be obtained, which is easy to sort out.

Clearly need the external interface of Mock, and configure it, refer to the "Link Management Module Design" section.

Application modification and pressure test configuration

For the modification of the access pressure test application to meet the "pressure test conditions of service", refer to figure 7.

The stress testing application depends on the configuration of middleware, and the system can find it automatically according to the link information. Provide unified configuration and verification of the page function.

Quake preparation

The automatic system of pressure measurement is based on Quake, which depends on flow recording, playback, pressure and so on. Therefore, it is necessary to configure the "flow task" of traffic recording and the "stress test scenario" performed by the pressure test on the Quake.

Data construction

Configuration data construction logic, of course, the existing logic is reusable units, you can first check whether the existing logic can meet your needs.

Configure the data construction process.

Pressure test implementation

Set the pressure test plan, to the start time, the system will automatically start the pressure test.

In the pressure test, pay attention to the alarm information of the pressure test verification and deal with it in time.

After the pressure test, you can view the pressure test report. Record and follow up the problems found.

Summary and prospect

At present, the automatic pressure testing system has been put into use, Meituan Hotel and all the teams on vacation in the country have been connected, which has effectively improved the efficiency of pressure testing. The follow-up will continue to build and upgrade in two major directions: one is to put the full-link pressure test into the field of "capacity evaluation and optimization", which not only pays attention to the stability of the overall system, but also expects to take into account the balance of costs; the other is the ecological integration with other sub-areas of stability, such as fault drills, elastic scaling, and so on, which play the role of stress testing in more scenarios. Finally, through these efforts, the stability of the online system becomes a deterministic thing.

references

[1] the practice of full-link pressure testing platform (Quake) in Meituan

[2] Ali JVM-Sandbox

[3] generalization call of Dubbo

[4] dynamic compilation of Java

A brief introduction to the author

Ou long, Meituan R & D engineer, joined Meituan in 2013 and is currently mainly responsible for the stability construction of domestic holiday transactions.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report