How to practice the High availability of Micro Services for ChaosBlade and SkyWalking 04/17 Update SLTechnology News&Howtos

How to practice the High availability of Micro Services for ChaosBlade and SkyWalking

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail how to carry out the high-availability practice of ChaosBlade and SkyWalking micro-services, and the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Preface

Under the distributed system architecture, there are many service components and complex dependencies between services, so it is difficult to evaluate the impact of a single fault on the whole system, and the request link is long. if the basic services such as monitoring alarm, logging and other basic services are not perfect, fault response and fault location will be difficult, so how to build a highly available distributed system is faced with great challenges. Chaos engineering is produced in a controllable range or environment, by injecting faults into the system, observing the system behavior and finding system defects, in order to establish the ability and confidence of the distributed system to cause chaos due to unexpected conditions. continue to improve the stability and high availability of the system.

The implementation process of chaos engineering is to make chaos experiment plan, define steady-state index, make system fault-tolerant behavior hypothesis, then carry out chaos experiment, check system steady-state index and so on. Therefore, the whole process of chaos experiment needs reliable, easy-to-use and scene-rich chaos experimental tools to inject faults and complete distributed link tracking and system monitoring tools. in order to trigger the emergency response early warning scheme and quickly locate the fault, and observe the data indicators of the whole process system. In this article, we introduce chaos Experimental tool (ChaosBlade) and distributed system Monitoring tool (SkyWalking), and share the high availability practice of ChaosBlade and SkyWalking microservices with a case study of microservices.

Tool introduction

1. ChaosBlade

ChaosBlade is a chaos engineering tool that follows the experimental principle of chaos engineering, provides rich fault scenarios, and helps distributed systems improve fault tolerance and recoverability. It can realize the injection of underlying faults and ensure business continuity in the process of cloud or cloud native system migration. It is characterized by simple operation, non-intrusion and strong scalability. ChaosBlade can continuously improve the stability and high availability of the system through fault injection in a controllable range or environment.

ChaosBlade is not only easy to use, but also supports rich experimental scenarios, including:

Basic resources: such as CPU, memory, network, disk, process and other experimental scenarios

Java applications: such as databases, caches, messages, JVM itself, microservices, etc., and you can also specify any class of methods to inject into a variety of complex experimental scenarios

C++ applications: such as experimental scenarios such as specifying arbitrary method or line of code injection delay, tampering with variables and return values, etc.

Docker container: such as killing container, CPU in container, memory, network, disk, process and other experimental scenarios

Cloud native platform: for example, CPU, memory, network, disk, process experiment scenarios on Kubernetes platform nodes, Pod network and Pod itself experiment scenarios are as follows: Pod, container experiment scenarios such as Docker container experiment scenarios above

ChaosBlade encapsulates the implementation of the scene into separate projects according to the domain, which can not only standardize the implementation of the scene in the domain, but also facilitate the horizontal and vertical expansion of the scene. By following the chaotic experimental model, the unified call of chaosblade cli is realized.

2. SkyWalking

SkyWalking is an open source APM system that includes monitoring, tracking, and diagnostics of distributed systems in a cloud-native architecture. The core features are as follows:

Analysis of service, service instance and endpoint index

Root cause analysis

Analysis of service topology diagram

Service, service instance and endpoint dependency analysis

Slow services and endpoints detected

Performance optimization

Distributed tracking and context propagation

Database access metrics. Detect slow database access statements (including SQL statements).

Call the police

Installation and use of tools

ChaosBlade is easy to install and use. All scenarios of ChaosBlade are called through chaosblade cli. You only need to download the corresponding tar package. After decompression, use the blade executable file to carry out chaos experiments.

1. ChaosBlade installation

This time, our actual environment is linux-amd64. Download the latest version of chaosblade-linux-amd64.tar.gz package. The installation steps are as follows:

# # download wget https://chaosblade.oss-cn-hangzhou.aliyuncs.com/agent/github/0.9.0/chaosblade-0.9.0-linux-amd64.tar.gz##, extract tar-zxf chaosblade-0.9.0-linux-amd64.tar.gz##, set the environment variable export PATH=$PATH:chaosblade-0.9.0/## Test blade- h

2. ChaosBlade usage

After the ChaosBlade installation is complete, you only need to use the blade executable file to create chaotic experiments for all the scenarios currently supported. First of all, use blade-h to see how to use it. After selecting the subcommand, you only need to use-h layer by layer to see the complete use case and detailed parsing of the parameters. Let's demonstrate it:

1) how to use blade

Execute blade-h to see which commands are supported:

An easy to use and powerful chaos engineering experiment toolkitUsage: blade [command] Available Commands: create Create a chaos engineering experiment destroy Destroy a chaos experiment...

2) create an experimental scene

For example, to create a fully loaded CPU scene, execute blade create cpu fullload-h to view the specific scene parameters, and select the appropriate parameters to execute:

Create chaos engineering experiments with CPU loadUsage: blade create cpu fullloadAliases: fullload, fl, loadExamples:# Create a CPU fullload experimentblade create cpu load#Specifies two random kernel's full loadblade create cpu load-cpu-percent 60-cpu-count 2...Flags:-blade-release string Blade release package Use this flag when the channel is ssh-- channel string Select the channel for execution, and you can now select SSH-- climb-time string durations (s) to climb--cpu-count string Cpu count-- cpu-list string CPUs in which to allow burning (0-3 or 1pr 3)-- cpu-percent string percent of burn CPU (0100).

3) recovery experiment

ChaosBlade supports three ways to restore experiments:

When ChaosBlade successfully creates an experiment, it will return a UID, and you can execute blade destroy uid.

If the corresponding UID cannot be found, simply execute blade destroy target action, such as blade destroy cpu fullload.

Create the experiment with the-timeout 10 parameter, which will be automatically restored after 10 seconds of execution of the experimental scene, and support expressions, such as three minutes-timeout 30m.

3. SkyWalking installation & use

After the tools are deployed, we will take the initiative to attack with cases, through fault injection, observe system behavior, locate problems and identify system defects, in order to build a highly available micro-service system.

Application of fault-tolerant cases

We deploy a micro-service application in a daily environment for experiments, using ab tests to simulate system requests. Micro-service application service includes front end, shopping cart, recommendation service, merchandise, order, etc., and usage components include Springboot, Nacos, Mysql, Redis, Lettuce, Dubbo and so on. ChaosBlade supports most of the components of the application. We use ChaosBlade to inject chaos experiments to verify the fault tolerance of the application and use SkyWalking for application monitoring and problem location.

1. Case environment

Linux-AMD64, release version CentOS-7.x

JDK1.8

two。 Apply topology

The overall structure of the application is as follows: the front end (frontend) calls shopping cart (car), product (product) and so on through Dubbo strong dependency.

3. Chaos experimental procedure

Make a chaos experiment plan

Define system steady state index

Make the assumption of system fault-tolerant behavior

Perform chaos experiment

Check steady state index

Recording and restoring chaos experiment

Fix the problems found

Automated continuous verification

Next, we will use ChaosBlade to actually carry out the chaos experiment according to the chaos experiment steps.

4. Case one

1) scenario

Make chaos experiment plan, call downstream services frequently delay, use ab test, simulate normal access to shopping cart interface, start 2 threads, and conduct 10000 interface visits.

Ab-n 10000-c 2 http://127.0.0.1:8083/cart

2) Monitoring indicators

Define the system steady-state metric, and select the / cart endpoint in the SkyWalking console. The steady-state metric is as follows:

The average response time (RT) is about 15ms.

The index of P99 is within 20ms.

3) expectation hypothesis

Configure the call timeout so that client requests will not be blocked for a long time.

Configure service circuit breaker policy / service degradation.

4) chaos experiment

In the previous section, we have introduced the installation and simplicity of ChaosBlade. In this case, we use ChaosBlade to inject a delay fault (delay time of 30 seconds) into the downstream Dubbo shopping cart service. Execute the blade create dubbo delay-h command to view the command usage of dubbo call delay:

Dubbo interface to do delay experiments, support provider and consumerUsage: blade create dubbo delayExamples:# Invoke com.alibaba.demo.HelloService.hello () service, do delay 3 seconds experimentblade create dubbo delay-time 3000-service com.alibaba.demo.HelloService-methodname hello-consumerFlags:-appname string The consumer or provider application name-consumer To tag consumer role experiment. -effect-count string The count of chaos experiment in effect--effect-percent string The percent of chaos experiment in effect--group string The service group-h,-- help help for delay-- methodname string The methodname-- offset string delay offset for the time-- override only for java now Uninstall java agent-pid string The process id-process string Application process name-provider To tag provider experiment-service string The service interface-time string delay time (required)-timeout string set timeout for experiment in seconds-version string the service versionGlobal Flags:-d -debug Set client to DEBUG mode-uid string Set Uid for the experiment, adapt to docker

With reference to the case and parameter explanation, the upstream service client needs to inject delay failure (delay time is 30 seconds). With the help of SkyWalking, you can easily find information about Dubbo service on the link. First, query the link with the endpoint / cart, and find the Dubbo service on the link, as shown below:

Find Link

Get protocol details

Click in to view the detailed span information of the Dubbo service. After obtaining the URL of the Dubbo service, you can get the parameters needed to inject the upstream service delay using ChaosBlade, so our final parameter structure is as follows:

-- time 30000 delay 30s

-- service com.alibabacloud.hipstershop.cartserviceapi.service.CartService service

-- methodname viewCart service method

-- process frontend Java process

-- consumer is currently a Dubbo service client

Issue command to inject fault:

Blade create dubbo delay-time 30000-service com.alibabacloud.hipstershop.cartserviceapi.service.CartService-methodname viewCart-process frontend-consumer

5) Monitoring indicators

Check the system metrics after the injection failure and view the metrics on the SkyWalking:

The average response time (RT) is about 2000ms, and the P99 index is about 2000ms.

The / cart API call reported an error, and an exception occurred in the com.alibabacloud.hipstershop.cartserviceapi.service.CartService service.

An timeout exception occurred. The timeout is 2000ms.

The conclusion shows that the upstream service is configured with call timeout, but no service circuit breaker policy is configured, which is not in line with expectations.

6) fix the problem

Configure service circuit breaker policy / service degradation.

5. Case two

1) scenario

During the operation, the Dubbo service provider failed to access the registry and injected 100% packet loss into the failed network in the registry machine.

2) Monitoring indicators

Define the system steady-state metric, and select the service endpoint in the SkyWalking console. The steady-state metric is as follows:

Com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal

3) expectation hypothesis

The upstream service business will not be affected and the downstream service will not be affected.

4) chaos experiment

For the packet loss failure caused by injecting packets into the registry port, we use nacos as the registry of Dubbo. The default port is 8848 and the network card is eth0. The command parameters are as follows:

-- interface eth0 network card

-- percent 100packet loss rate

-- local-port local port 8848

Issue command to inject fault:

Blade create network loss-interface eth0-percent 100-local-port 8848

5) Monitoring indicators

After the injection failure, select the service endpoint in the SkyWalking console. The steady-state metrics are as follows:

Com.alibabacloud.hipstershop.cartserviceapi.service.CartService.viewCart service is normal

Conclusion: the service is weakly dependent on the registry and the service itself has a local cache, which accords with the expectation hypothesis.

Assuming that the application is now deployed in a Kubernetes cluster, the verification registry can be expanded horizontally, and ChaosBlade also supports Kubernetes cluster scenarios.

6. And try it out

In the above case, we verified that the service is configured with a timeout and circuit breaker policy, that Dubbo is weakly dependent on the registry and that the service itself has a local cache. Do you also want to jump, want to experience in your own system? ChaosBlade has prepared a wealth of experimental scenarios for you, not only supporting basic resources and application dimensions, but also a sharp weapon for cloud native platforms. ChaosBlade is easy to use and provides detailed parameters to control the minimum explosion radius of the fault. I believe that ChaosBlade will make it very easy for you to use.

The truth learned from paper is not profound, here we provide an additional small case for everyone to practice. We often communicate with the relational database in the application development, but when the application traffic increases rapidly, the bottleneck often occurs on the database side, resulting in a lot of slow SQL. When there is no slow SQL warning, it is difficult to find the original SQL and optimize it, so slow SQL early warning is very important. How to verify that the application has this capability, ChaosBlade can support the injection of MySQL slow SQL failure, execute blade create mysql delay-h to view the command usage of delayed MySQL calls:

Mysql delay experimentUsage: blade create mysql delayExamples:# Do a delay 2s experiment for mysql client connection port=3306 INSERT statementblade create mysql delay-time 2000-sqltype select-port 3306Flags:-database string The database name which used-effect-count string The count of chaos experiment in effect--effect-percent string The percent of chaos experiment in effect- h Help help for-- host string The database host-- offset string delay offset for the time-- override only for java now, uninstall java agent-- pid string The process id-- port string The database port which used-- process string Application process name-- sqltype string The sqltype For example, select, update and so on. -- table string The first table name in sql. -- time string delay time (required)-- timeout string set timeout for experiment in secondsGlobal Flags:-d,-- debug Set client to DEBUG mode-- uid string Set Uid for the experiment, adapt to docker

You can see that ChaosBlade provides a complete case that supports finer-grained parameters such as SQL types, table names, and so on. When connecting to the port, the select operation of 3306 is delayed by 10 seconds. When the traffic hits, is there an early warning in your application?

Blade create mysql delay-time 10000-sqltype select-port 3306

The command parameters explain:

-- time 10000 delay 10s

-- sqltype select only supports SQL statements of type select

-- port 3306 only supports connections with port 3306

This is the end of the highly available practice on how to carry out ChaosBlade and SkyWalking micro-services. I hope the above content can be of some help and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.