Monitoring and Analysis of Skywalking Micro Service 07/06 Update SLTechnology News&Howtos

Monitoring and Analysis of Skywalking Micro Service

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Reprint this article need to indicate the source: Wechat official account EAWorld, violators will be prosecuted.

Introduction:

After the implementation of the micro-service framework, the problems brought about by the distributed deployment architecture will be quickly highlighted. In the process of mutual invocation between services, if there are errors or exceptions in the business, how to quickly locate the problem? How to track the service call link? How to analyze and solve the business bottleneck? ... In this article, let's look at how to solve the above problems.

Table of contents:

I. A preliminary study of SkyWalking

Second, service call link monitoring

Third, service performance index monitoring

IV. Service alarm

I. A preliminary study of SkyWalking

Introduction to Skywalking

Skywalking is a domestic open source application performance monitoring tool, which supports the monitoring, tracking and diagnosis of distributed systems.

It provides the following main features:

Skywalking technical architecture

The overall SW can be divided into four parts:

1.Skywalking Agent: use Javaagent for bytecode implantation, non-intrusive collection, and send data to Skywalking Collector via HTTP or gRPC.

2. Skywalking Collector: a link data collector that integrates, analyzes and processes the data transmitted by agent and falls into the relevant data storage.

3. Storage:Skywalking storage, time change. Sw has been developed and iterated to version 6.x. In version 6.x, ElasticSearch, Mysql, TiDB and H2 are supported for data storage.

4. UI: Web visualization platform, which is used to display landing data.

Skywalking Agent configuration

By understanding the configuration, you can have a general understanding of the functions of a component. Let's take a look at the configuration of skywalking.

Unpack the skywalking package and you can see the agent configuration file in the agent/config folder.

The configuration loading of environment variables is supported from skywalking, and the relevant configurations in the environment variables are read first at startup.

Agent.namespace: header in a cross-process link, and different namespace can cause cross-process link outage

Agent.service_name: the unique identity of a service (project). This field determines the display name of the service on sw's UI.

Agent.sample_n_per_3_secs: client sampling rate. Default is-1 for full sampling.

Agent.authentication: security authentication for communicating with collector, which needs to be the same as the configuration in collector

Agent.ignore_suffix: trace that ignores a specific request suffix

Collecttor.backend_service: the IP and port that agent needs to transfer data to collector

Logging.level: agent logging level

Skywalking agent uses javaagent non-intrusive cooperation with collector to realize the tracking of distributed systems and the context transfer of related data.

Skywalking Collector key configuration

Collector supports cluster deployment. Zookeeper, kubernetes (if your application is deployed in a container) and consul (a service discovery tool developed in GO language) are optional cluster management tools for sw, which are selected according to your specific deployment methods. You can go to the Skywalking official website to download the media package for detailed configuration.

Collector Port Settin

Downsampling: sample and summarize the statistical dimension, and statistics the metric data by minute, [hour, day, month] (optional).

Data can be cleaned automatically by setting TTL-related configuration items.

Skywalking simplifies configuration in 6.x. Collector provides two communication modes: gRPC and HTTP.

UI uses rest http communication, agent uses grpc communication in most scenarios, and http communication in cases where the language does not support it.

One thing to note about binding IP and ports is that by binding IP,agent and collector, you must configure the corresponding ip to communicate properly.

Collector storage configuration

Select the type of database to use in the storage module configuration configured in application.yml and fill in the relevant configuration information.

Collector Receiver

Receiver is a new concept put forward by Skywalking in 6.x, which is responsible for receiving indicator data from the monitored system. Users can upload custom monitoring data with reference to the OpenTracing specification. Skywalking officially provides the relevant capabilities of service-mesh, istio and zipkin.

Skywalking now supports server sampling. The configuration item is sampleRate and proportional sampling. If configured as 5000, the sampling rate is 50%.

A note about sampling settings

For a suggestion on service sampling configuration, if Collector is deployed in a cluster mode, such as Acollector and Bcollector, it is recommended that Acollector.sampleRate = Bcollector.sampleRate. Data loss may occur if the sampling rate settings are not the same.

Suppose that the Agent side sends all the data to the back-end Collector, and the A sampling rate is set to 30% and the B sampling rate is set to 50%.

Suppose that 30% of the data is sent to A, and all the data is correctly accepted and stored. In extreme cases (the same as the expected amount of sampled data), if the remaining 20% of the data to be sampled is sent to B, everything is normal at this time. If some of the 20% of the data is sent to A, then the data will be ignored, resulting in data loss.

Second, service call link monitoring

Service Topology monitoring

Calling link monitoring can be viewed from two perspectives. Let's take a look at the system we monitor as a whole.

After adding a probe to the service and generating the actual invocation, we can view the invocation relationship between services through the front-end UI of the Skywalking.

We simply simulate a call between services. Create two new services, service-provider and service-consumer, and simply simulate remote calls between services through Feign Client.

You can see from the figure:

There are two service nodes: provider & consumer

There is a database node: localhost [MySQL]

A registry node

Consumer consumes the interface provided by provider.

The topology diagram of a system gives us a clear understanding of the dependence of applications between systems and the business flow process in the current state. If you are careful, you may find that part of the icon node consumer is red. What does red mean?

Red indicates that the request currently flowing through the consumer node has an abnormal response within a certain period of time. When all the nodes turn red, it proves that the service is completely unavailable at this stage. Operation and maintenance staff can quickly find a potential problem in a service through Topology, and carry out further investigation and prevention.

Skywalking Trace monitoring

Skywalking conducts dependency analysis through business call monitoring, providing us with service invocation topologies between services and trace records for each endpoint.

We saw an error in the consumer node service earlier. Let's find out where and why the error occurred.

The time of the current request, the GloableId, and the time when the request was called can be seen in each trace message. Let's take a look at the correct call and the exception call, respectively.

Trace calls link monitoring

The figure shows a normal response with a total time-consuming 19ms, which has four span:

Span1 / getStore = the total flow time of the 19ms response

Span2 / demo2/stores = the total response time since 14ms feign client started invoking the remote service

Span3 / stores = total response time of the 14ms interface service

Span4 Mysql = the time for the 1ms service provider to query the database

Here the time performance of span2 and span3 is the same, but it is actually different, because here the time is taken as a whole.

You can view the relevant properties of the current Span in each Span.

Component types: SpringMVC, Feign

Span status: false

HttpMethod: GET

Url:

Http://192.168.16.125:10002/demo2/stores

This is a normal request to call the Trace log, maybe we don't care about the normal time, after all, everything is normal is what we expect!

Let's take a look at what our Trace and Span look like under abnormal conditions.

The is error identity in the Span in the call chain where the error occurred becomes true, and the specific cause of the error can be seen in the TAB named Logs. According to the abnormal situation, we can easily locate the specific reasons that affect the business, so as to quickly locate and solve the problem.

If we see that the connection is rejected through the Log, it may be that there is a problem with our network (it is unlikely that we can not even see the trace if there is a problem with the network), or it may be that the server configuration is unable to establish the connection correctly. Through the exception log, we quickly found the crux of the problem.

The truth is, I stopped the server and did a simple simulation. It can be seen that through the topology diagram we can clearly see which of the many services has a problem, through the trace log we can quickly locate the problem and solve the problem in the shortest time.

Third, service performance index monitoring

Skywalking can also view the specific Service performance indicators, according to the relevant performance indicators can analyze the bottleneck of the system and put forward the optimization scheme.

Skywalking performance monitoring

Click the corresponding node on the service invocation topology diagram and we can see the

SLA: service availability (mainly calculated by the number of successful and failed requests)

CPM: number of calls per minute

Avg Response Time: average response time

From the perspective of the application as a whole, we can monitor the application within a certain period of time.

Service availability indicator SLA

Average number of responses per minute

Average response time

Service process PID

IP, HostName, Operation System of the physical machine where the service resides

Service JVM information monitoring

You can also monitor the CPU, heap memory, non-heap memory usage, and GC of the Service runtime. This information comes from JVM. Note that the data here is not the data of the machine itself.

IV. Service alarm

We mentioned earlier that the problem can be located by viewing the topology diagram and calling the link, but it is impossible for the operator to keep an eye on the data, so we need the alarm ability to actively prompt us to check the system status when the exception reaches a certain threshold.

The alarm capability of service status has been added in Sywalking version 6.x. It allows us to customize the way we are notified of our alarm information through webhook. Such as: email notification, Wechat notification, SMS notification and so on.

Skywalking service alarm

First, let's take a look at the rule configuration of the alarm. Alarm rules can be configured in alarm-settings.xml, and alarm rules can be customized.

An alarm configuration consists of the following parts:

Service_resp_time_rule: alarm rule name * * _ rule (rule name can be customized but must end with'_ rule'

Indicator-name: metric data name: definition see http://t.cn/EGhfbmd

Op: operator: >, <, = [of course you can extend and develop other operators yourself]

Threshold: target value: the target data of metric data, such as 1000 in sample, is the service response time, and the operator is the service response greater than 1000ms.

Period: alarm check cycle: how often do you check whether the current metric data conforms to the alarm rules?

Counts: the number of times the alarm threshold has been reached

Silence-period: ignore the period of the same alarm information

Message: alarm information

Webhooks: address of service alarm notification service

Skywalking remotely invokes the alarm notification service address defined in the configuration item webhooks through HttpClient.

Knowing the data format transmitted by SW, we can receive and process the alarm information and realize the alarm notification service we need.

We stop one service and let another service's exposed interface sleep for a certain amount of time. Then call a certain number of times to observe the status information and alarm of the service.

Summary:

This article simply through the configuration of skwaylking to make a preliminary understanding of the functions of skywlaking, a simple interpretation of the new concepts and new functions of skwaylking, convenient for everyone to understand and use. By using the APM tool, we can easily see the system bottlenecks and performance problems in the micro-service architecture.

Selected questions:

Q1: would you like to know whether to use pinpoint or SK when choosing a model?

Answer: the problem of type selection

1. Combine with specific business scenarios, such as whether your code is running in java, php, net or something. 2.pinpoint is slightly more complex in installation and deployment than skywalking. The list of components supported by 3.pinpoint and sw is different.

Https://github.com/apache/incubator-skywalking/blob/master/docs/en/setup/service-agent/java-agent/Supported-list.md you can refer to the support list here and compare the supported objects of pinpoint to make a simple comparison.

4.sw has been tested to have better throughput than pinpoint in the case of high concurrency.

Q2: are there any metrics, such as the top10 requests of a url and the 10 slowest requests? The percentage of time consumed by a service in the entire chain?

Answer: 1.sw comes with the slowest response request top10 statistics for all endpoint statistics.

two。 For each url top10 statistics, sw itself does not do statistics, the data are ready-made through a simple search can find the results you want.

3. There is no specific percentage of time consumed, but there are specific statistics of total link time and the time spent of a service. As for the proportion calculated by yourself, you can see the span time explanation of calling link monitoring in ppt.

Q3: can you tell me more about the application in your system?

A: in the EOS8LA version, we integrate sw to provide monitoring of topology, call links and performance metrics for applications, and add dimensions to the system based on sw data.

When the number of services is very large, the overall topology is actually a dense spider web. We can select the application under a specific system through the system.

SW in 8LA is the 5.0.0alpha version, which is limited by the sw feature. We do not provide alarm capability, which will be our goal in the future.

Q4: the business access log is about 100g per day deployed in Kubernetes environment. Is it stable to use?

A: there is no need to store monitoring data for a long time, unless you have specific requirements. It has a certain degree of timeliness, you can set ttl to automatically remove outdated information. The 100g force es cluster can be easily supported.

Q5: are there any advantages over pinpoint?

Answer: 1. Easy to deploy and use

two。 More features are supported

3. High concurrency performance is better.

Ask 6:skywalking 's intrusive tracking function to facilitate service tracking of a single service chain. But is there any overall design consideration for tracking the overall service chain across multiple servers and multiple projects?

A: the nature of sw itself is to track distributed systems, and it is non-invasive. It doesn't matter how many servers your application is deployed on.

Q7: the performance of applications degrades after the addition of agents. Do you have any solutions?

A: performance degradation is inevitable, but as far as I know, as well as official tests, his performance impact is very low. This is the test data of sw for your reference.

Https://skywalkingtest.github.io/Agent-Benchmarks/README_zh.html .

Q8: can I use sw if there are heterogeneous system requirements?

A: as long as the skywalking probe supports it, it should be possible.

What is 9:sw 's support for commercial web middleware, such as bes, tongweb, websphere, weblogic?

A: there is less support for commercial components, because when it comes to issues related to license, the sw project team needs their support for data reporting. As far as I know, the support is not very good.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.