How to construct the early warning ecosystem of Uber 07/09 Update SLTechnology News&Howtos

How to construct the early warning ecosystem of Uber

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to build the early warning ecosystem of Uber". In the daily operation, I believe that many people have doubts about how to build the early warning ecosystem of Uber. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "how to build the early warning ecosystem of Uber"! Next, please follow the editor to study!

Uber's software architecture consists of thousands of micro services that enable the team to iterate quickly and support our company's global growth. These microservices support a variety of solutions, such as mobile applications, internal and infrastructure services, and products, as well as complex configurations of these products that affect cities and suburbs.

To maintain our growth and architecture, Uber's Observability team has built a strong, scalable metrics and alert pipeline that detects, mitigates and notifies engineers when there is a service problem. Specifically, we built two data center alert systems, called uMonitor and Neris, which flow into the same notification and alert pipeline. UMonitor is our metrics-based alert system that runs checks against metrics database M3, while Neris mainly looks for alerts in the host-level infrastructure.

Both Neris and uMonitor use public pipelines to send notifications and deduplication. We will delve into these systems and discuss how to take more mitigation measures, the new alert deduplication platform Origami, and the challenges in creating high signal-to-noise ratio alerts.

In addition, we have developed a black box alarm system to detect high-level interruptions outside the data center in the event of an internal system failure or a complete outage of the data center. This setting will be discussed in future blog posts.

Alarm system

In our alert architecture, the service sends metrics to M3. UMonitor checks M3 for metrics-based alerts. Host checks are sent to Neris for summarization and alerts. Blackbox tests the API infrastructure from outside of Uber.

In terms of the scale of Uber, monitoring and alerting need to be thought outside of traditional off-the-shelf solutions. Alerts for Uber start with Nagios and use source control scripts to publish Graphite threshold checks for metrics. Because of the scalability problem of our Carbon metrics cluster, we decided to build our own large metrics platform M3. In order to improve the availability of the alarm system, we have developed uMonitor, which is our own alarm system based on time series indicators, which is used for indicators stored in M3. For metrics that are not stored in M3, we build Neris to perform host-level alert checks.

UMonitor is built with flexibility and diversity of use cases in mind. Some alerts are automatically generated based on standard metrics, such as endpoint errors and CPU / memory consumption. Other alerts are created by individual teams and are related to metrics specific to their needs. We built uMonitor as a platform to handle these different use cases, especially:

Easily manage alerts: iterate to determine the appropriate functions and thresholds for alerts

Flexible operation: notifications such as paging, email and chat. Support for automatic mitigation measures, such as rolling back deployment and configuration changes

Dealing with a high base: the ability to issue alerts on critical issues at a minimum, but not a large number of outage notifications for the team

Using uMonitor to deal with metrics-based alarm

UMonitor has three separate components: a storage service with an alert management API that encapsulates our Cassandra alerts and state storage; and a scheduler that tracks all alerts and assigns alert check tasks to staff every minute for each alert; and staff who perform alert checks based on the underlying metrics defined by alerts.

The staff maintains the alert check status in the Cassandra store and ensures that at least one notification is sent through the active retry mechanism. Workers are also responsible for reissuing the alarm at regular intervals (usually every hour) to continue to sound the alarm. Currently, uMonitor has a configuration of 125000 alerts that can check 700 million data points per second in 1.4 million time series.

The alert definition has an M3 query (Graphite or M3QL) and thresholds that determine whether the alert violates the threshold. The query returns one or more time series from M3 and applies the threshold to each base series. If the query violates the threshold, an alert action is sent. The staff maintains a state machine, with the help of the state stored in Cassandra, that ensures that notifications are sent at least after the alert is triggered, resent periodically when the alert is triggered, and resolved when the problem is mitigated.

There are two types of thresholds: static threshold and exception threshold. For metrics with a specific steady state, or we can construct queries that return consistent values by calculating success / failure percentage equivalents, usually using static thresholds. For periodic metrics such as travel counts and other business indicators for each city, uMonitor uses our anomaly detection platform Argos to generate dynamic thresholds that represent outliers based on historical data.

Using Neris to deal with host alarm

Neris is our host-based internal alarm system designed for high-resolution per-host high-cardinality metrics that are not available in our M3 indicator system. There are two reasons why host metrics are not in M3. First, check each of the 1.5 million host metrics generated per minute for every 40000 hosts in each data center, and it is more efficient to perform on the host than in the central metrics store. In this way, there is no need for the overhead of ingesting and storing indicators. Second, until recently, M3's retention policy caused 10-second metrics to be stored for 48 hours, while one-minute metrics were stored for 30 days, and there was no need to use the retention and solution to store host metrics. Because Nagios required code to be written and deployed for each check, but could not be expanded as our infrastructure grew, we decided to build a system internally.

Neris's agent can run on every host in our data center and perform regular (minute-by-minute) alert checks on the host itself. The agent then sends the inspection results to the aggregation layer, which in turn sends the aggregation results to the Origami. Origami is responsible for determining which alerts to send based on rules for viewing the number of failed alerts and the severity of the underlying alerts. Use Origami,Neris to run approximately 1.5 million checks per minute on the host in each data center.

When the agent is started on each host, Neris extracts the host's alert definition information from the central configuration storage called Object Config, which is widely used by Uber's low-level infrastructure services. Determining which alerts will run on a given host depends on its role. For example, a host running Cassandra will check the status of Cassandra, disk usage, and other metrics. Most of these host-level checks are created and maintained by the infrastructure platform team.

Dealing with high cardinality

A high base has always been the biggest challenge for our alarm platform. Traditionally, this is handled by causing the alert query to return multiple sequences and triggering simple rules around the alert only if a certain percentage of the sequence violates the threshold. UMonitor also allows users to set alerts to depend on other alerts-alerts that track a wider range of problems depend on a wider range of alerts, and dependent alerts are suppressed if a wider range of alerts are triggered.

As long as the query returns a limited number of sequences, the above techniques work well and can easily define dependencies. But as Uber increasingly operates many different product lines in hundreds of cities, the cardinality challenge has required a more general solution. We use Origami to help with high cardinality. Neris uses Origami as its primary deduplication and notification engine and enables merge notifications for uMonitor alerts.

For business metrics, Origami is very useful when we need to alert each city, every product, and every version of the application. Origami allows users to create basic alerts / checks for a combination of city, product, and application versions and to issue alerts on summary policies to receive notifications based on the version of each city / product / application. In the case of a large power outage (for example, when problems occur in many cities at the same time), Origami sends a summary notification indicating the list of underlying alerts triggered.

In the host alert scenario, Origami enables us to send notifications of various severities based on the summary status of alerts. Let's look at an example of disk space usage on a Cassandra cluster. In this case, the Origami notification policy for this may be similar to:

Send an email notification if less than three hosts use 70% of the disk

If more than three hosts have 70% disk utilization, send the page

If the disk utilization of one or more hosts reaches 90%, send the page

Alarm notification

Useful alert notifications are a major challenge in extending the alarm system. Alert actions mainly start with notifications, such as calling engineers for high priority issues and using chat or email for informational questions. Now our focus has shifted to the development of mitigation measures. Most events and outages occur as a result of configuration changes or deployment. UMonitor provides best-in-class support for mitigation operations that roll back recent configuration changes and deployments. For teams with more complex mitigation manuals, we support webhooks, which makes POST calls to endpoints with the full context of alerts, allowing them to run mitigation manuals. In addition, by leveraging deduplication pipes in Origami, we can suppress finer-grained notifications in the event of a larger failure.

In addition to the above, we have been working hard to make the notification more relevant and to target the appropriate individuals. Recent work involves identifying the owners of configuration / deployment changes and triggering them when alerts trigger their modified services. By combining trace and alert information from Jaeger, we have made additional efforts to provide more context in alert notifications around related service failures.

Alarm management

As mentioned earlier, we have been working on building uMonitor as a platform that other teams can build for specific use cases. The setup and management of host alerts are usually specialized, mainly for teams that maintain their own dedicated hardware, as well as teams that are building infrastructure platforms for the company, including storage, metrics, and computing solutions. Alerts are configured in the team's separate git repositories, which have been synchronized to Object Config.

From a high point of view, uMonitor has three types of alerts:

Automatically generate alerts based on standard metrics of CPU, disk utilization and RPC statistics for all services

One-time alerts created by UI to detect specific problems

Create and manage alerts through scripts and external configuration systems on uMonitor

As the team strives to detect alertable problems at the best possible granularity, we see the greatest growth in the last category of alerts. The need for this granularity is attributed to the global growth of Uber. Code changes to services that support Uber mobile applications are usually rolled out to specific urban groups within hours. It is very important for us to monitor the operation of the platform at the city level in order to identify the problem before it spreads widely. In addition, the configuration parameters controlled by the engineering and local operations teams vary from city to city. For example, riders in cities may be blocked in the streets due to ongoing events such as parades, or other events may lead to traffic changes.

Many teams have built alert generation solutions on uMonitor to address such use cases. Some of the challenges addressed by these tools are:

Iterate through each dimension and generate alerts

Determine the alarm schedule based on specific business information, such as holidays in a specific country / city, and configure this information in uMonitor to prevent false alarms

If static or current exception thresholds do not work, determine thresholds based on past data or complex queries on underlying metrics applicable to specific lines of business (more about the following alert queries)

In addition, many of these solutions generate dashboards that synchronize with the generated alerts.

UMonitor also provides powerful editing and root cause UI. The editing and experimental aspects of UI are critical because most metrics cannot be used for alerts as they are due to changes and spikes. The observability team provides guidance on how to create queries that are more suitable for alerts.

Alarm inquiry

The Graphite query language and M3QL provide a number of features to provide more customized solutions. Below, we outlined some examples that show how to make the query return more consistent values to make metrics more alert:

Remind you to move the average of the metric in a few minutes to eliminate any peak of the metric

Use the above in conjunction with the maintenance period to send a notification only after the threshold violation has lasted for a certain period of time

For metrics with up and down patterns, use derivative functions to ensure that peaks in either direction are not too sudden

Issue percentage / ratio alerts to make metrics less susceptible to changes in metric size

At this point, the study on "how to build an early warning ecosystem of Uber" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.