Example Analysis of Monitoring in Internet 04/24 Update SLTechnology News&Howtos

Example Analysis of Monitoring in Internet

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "sample Analysis of Monitoring in the Internet". The content is simple and clear. I hope it can help you solve your doubts. Let me lead you to study and learn the article "sample Analysis of Monitoring in the Internet".

First, the generation of monitoring requirements

When a program is delivered and deployed to a production environment, it is the beginning of the longest part of its life cycle. People need to know whether everything is normal in the production environment, and the need for monitoring arises naturally.

During the development of the Internet, a large number of monitoring-related tools / systems, Ganlia, Zabbix, RRDTools, Graphite, have emerged, each of which provides answers to "whether it is normal or not" at different levels.

Monitoring itself, whether it is the industry's awareness of monitoring, and the capabilities of monitoring tools / systems themselves, are also evolving in the following two directions:

Black box to white box

Resources to business

The vision of monitoring at this stage is very clear, and how to land is different.

Until Etsy disclosed their monitoring practices through blogs in 2011, using StatsD (already open source) to achieve data collection / storage / analysis at the resource / business level in a very simple and unified way. Later monitoring systems, especially those based on metrics, are mostly inspired and influenced by StatsD.

Second, the proposal of observability.

In the field of Internet engineering, Twitter should be the first organization to put forward observability. In this series of articles, Twitter focuses on their observability technology stack, including the open source implementation of Zipkin,Google Dapper.

As mentioned in the preface, this paper does not focus on the inclusion relationship between several nouns.

Despite the debate about these terms, the biggest change in observability compared with past monitoring is that the data that the system needs to deal with has expanded from metrics-based to a wider range of areas. Taken together, there are about several types of data that are seen as pillars of observability (pillar)

Metrics

Logging

Tracing

Events

Therefore, a modern monitoring system / observability engineering system must have the ability to store the above data properly.

III. Storage

Metrics

Metrics, which is usually time series data of numeric type. Such requirements exist so widely that they derive database subclasses, time series databases, and TSDB that specifically serve this goal.

TSDB has undergone an important evolution in the following aspects

Data model. The description information is stripped from the metric naming to form a tag. Modern tsdb has usually adopted the tag data model.

Data type. From simple numerical records to deriving more data types such as gauge, counter, timer and so on for different scenarios

Index structure. The index structure is closely related to the data model. In modern tsdb, which is dominated by tag, inverted index has become the mainstream index structure.

Data storage. From the era when rrdtool writes ring queues to files, to OpenTSDB, which encodes and decodes itself and writes to the underlying database, and then to the timing data compression algorithm proposed by Facebook, it is usually a combination of several technologies, and different schemes are adopted for different data types.

We will have another article on Metrics storage, or the research and evolution of TSDB.

Logging

Logging is usually the most direct way for engineers to locate production environmental problems. The processing of logs is evolved in the following aspects

Centralized storage / retrieval. So that engineers do not have to log on to the machine to view the log, the log is collected uniformly, stored centrally in the log service, and provides a unified retrieval service. This process involves issues such as uniform log format, parsing, structuring, and so on.

Log monitoring.

Keywords in the original text, such as error, fatal, high probability means that errors of concern are generated.

Metrics extracted from logs, such as the large amount of data carried in access log, can be extracted into useful information. As for the means of extraction, some are parsed locally in the log through the client, and some are parsed in a centralized stored procedure.

Tracing

With the increasing complexity of Internet engineering, especially in the trend of micro-services, distributed tracing is usually the most important means to understand and locate system faults.

At the storage level, tracing already has a relatively clear solution. Both OpenZipkin and CNCF's Jaeger provide almost out-of-the-box back-end software, including storage, of course.

The storage design of Tracing is mainly considered.

1. Sparse data: tracing data is usually sparse for several reasons

The trace paths of different businesses are usually different, that is, the span is different, so it is sparse

The trace of the same business has different paths under different internal and external conditions. For example, accessing the database and whether the cache is hit or not will produce different span chains.

Access normal / abnormal trace to generate different span

two。 Multi-dimensional query: the usual solution

Secondary index: it is common in schemes based on HBase and Cassandra

Inverted index is introduced. When the second-level index scheme can not satisfy all query requests, Elasticsearch auxiliary index may be introduced to improve query flexibility.

Events

It is also a term that is difficult to define but easy to describe. We call a deployment, a configuration change, a dns switch, and such changes as events.

They usually mean changes in the production environment. The failure is usually caused by inappropriate changes.

The processing of events mainly includes

Centralized storage: there are many kinds of events, so it is difficult to generalize the common query latitude, so the inverted index is a very suitable storage structure in this scenario where the query latitude cannot be determined in advance.

Dashboard: query / display events in an appropriate manner. Etsy's blog mentioned above shows good practices that allow engineers to easily confirm the relationship between website login failures and login module deployment events through dashboard.

The above is all the contents of this article "sample Analysis of Monitoring in the Internet". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.