How to use Druid for real-time analysis of service quality in Netflix 04/27 Update SLTechnology News&Howtos

How to use Druid for real-time analysis of service quality in Netflix

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how Netflix uses Druid for real-time business quality analysis". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

An introduction to Durid

Apache Druid is a high-performance real-time analysis database. It is designed for quick query and ingest workflows. The advantages of Druid are real-time data visibility, real-time query, operational analysis and handling of high concurrency.

Druid is not a relational database and requires a data source, not a table. Like relational databases, these are logical groupings of data represented as columns. Unlike relational databases, there is no concept of connections. Therefore, Netflix needs to ensure that each data source contains any columns by which Netflix will filter or group. There are three main types of columns in the data source-time, dimensions and metrics.

Everything about Druid depends on time. Each data source has a timestamp column, which is the main partitioning mechanism. Dimensions are values that can be used to filter, query, or group. Metrics are values that can be summarized.

By eliminating the ability to perform joins and assuming that the data is keyed by a timestamp, Druid can optimize the way data is stored, allocated, and queried, enabling Netflix to extend the data source to trillions of rows and still achieve query response times of less than ten milliseconds.

To achieve this level of scalability, Druid divides the stored data into multiple time blocks. The duration of the time block is configurable. You can choose the appropriate duration based on your data and use cases.

II. Problems encountered by Netfilx

Netflix uses real-time logs from playback devices as event sources, and Netflix can derive measurements to understand and quantify how user devices seamlessly handle browsing and playback.

Once you have these metrics, enter them into the database. Each measure is marked with anonymous details about the type of device used, such as whether the device is a smart TV, iPad or Android phone. This enables Netflix to classify devices and view data according to various aspects. This, in turn, enables the system to isolate issues that affect only specific populations, such as the version of the application, specific types of devices, or specific countries. To query using this aggregated data immediately through a dashboard or temporary query. It also continuously checks whether the metrics have alarm signals, such as whether the new version is affecting the playback or browsing of some users or devices. These checks are used to warn the responsible team that they can resolve the problem as soon as possible.

During the software update, Netflix enables the new version for some users and uses these real-time metrics to compare the performance of the new version with the previous version. Any regression in the metric will cause Netflix to signal to abort the update and restore those users who have restored the new version to the previous version.

Because this data can handle more than 2 million events per second, it is very difficult to put it into a database that can be queried quickly. Netflix needs enough dimensions to make the data useful in isolation problems, so Netflix produces more than 115 billion rows per day.

Three Netfilx deals with massive data analysis through Durid

Data intake

The insertion into the database occurs in real time. Instead of inserting a single record from the dataset, events are read from the Kafka stream (metrics in the case of Netflix). Each data source uses 1 theme. In Druid, Netflix uses the Kafka indexing task, which creates multiple indexers distributed among real-time nodes (intermediate managers).

Each of these indexers subscribes to the topic and reads its event share from the stream. The indexer extracts the value from the event message according to the intake specification and accumulates the created rows in memory. Once the row is created, it can be queried. The query to reach the block of time that the indexer is still populating will be provided by the indexer itself. Because the indexing task actually performs two tasks, namely, ingestion and on-site query, it is important to send the data to the "history node" in time to share the query work to the history node in a more optimized way.

Druid can summarize the data as it is ingested to minimize the amount of raw data that needs to be stored. A summary is a form of summary or pre-aggregation. In some cases, summary data can significantly reduce the size of the data that needs to be stored, which may reduce the number of rows by several orders of magnitude. However, reducing storage does come at a price: Netflix cannot query a single event, but can only query at a predefined query granularity. For the Netflix use case, Netflix chose a query granularity of 1 minute.

During extraction, if any rows have the same dimension and their timestamps are within the same minute (the query granularity of Netflix), the rows will be summarized. This means parallelism by adding all the metrics together and adding a counter, so Netflix knows how many events contribute to the value of the row. This summary form can significantly reduce the number of rows in the database thereby speeding up the query because Netflix can reduce the number of rows to be manipulated and aggregated.

Once the accumulated number of lines reaches a certain threshold, or the segment has been open for too long, the lines are written to a segment file and unloaded to deep storage. The indexer then informs the coordinator that the segment is ready so that the coordinator can tell one or more history nodes to load. Once the segment is successfully loaded into the History node, it can be unloaded from the indexer, and the history node will now provide any queries for the data.

Data processing.

With the increase of the dimension base, the possibility of the same event in the same minute decreases. Managing the cardinality and therefore summarizing it is a powerful lever to achieve good query performance. To achieve the desired ingestion rate, Netflix runs a number of indexer instances. Even if the summary merges the same rows in the indexing task, the chance of getting all the same rows in the same indexing task instance is very low. To solve this problem and achieve optimal summarization, Netflix plans to run the task after all segments of a given time block have been handed over to the history node. The planned compression task fetches all segments from deep storage for time chunking and performs a mapping / restore job to recreate the segments and achieve a perfect summary. The History node then loads and publishes new subdivisions to replace and replace the original, less aggregated subdivisions. By using this additional compression task, Netflix saw a twofold increase in the number of rows. It is not easy to know when to receive all events for a given block of time. There may be late data on Kafka topics, or the indexer may take some time to hand over these fragments to Historical Node.

Query mode

Druid supports two query languages: Druid SQL and native queries. In the background, Druid SQL queries are converted to local queries. Native queries are submitted as JSON to REST endpoints, which is the main mechanism used by Netflix.

Most queries to the cluster are generated by custom internal tools such as dashboards and alarm systems.

To speed up the adoption of Druid queries and reuse existing tools, Netflix adds a transformation layer that accepts Atlas queries, rewrites them as Druid queries, publishes queries, and reformats the results into Atlas results. This abstraction layer allows existing tools to be used as is and does not create any additional learning curve for users to access data in Netflix's Druid data store.

This is the end of the content of "how to use Druid for real-time analysis of business quality in Netflix". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.