What are the characteristics of Druid 04/26 Update SLTechnology News&Howtos

What are the characteristics of Druid

2025-04-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what are the characteristics of Druid". In daily operation, I believe many people have doubts about the characteristics of Druid. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "what are the characteristics of Druid?" Next, please follow the editor to study!

Druid is named after the druid characters in the game, such as a druid and a bear in Dota, and can also summon a bear to cut the crap. The main analogy can be applied to all kinds of scenarios.

Background

The data that Druid targets is log data. What is a journal? A log is a record of what happens in the system. For example, if you edit an entry in Baidu encyclopedia, you will generate a log that contains attributes such as the time of your operation, your name, the entry you edited, how many words have been added and how many words have been removed. Among these attributes, time is essential. Each log has a timestamp time,long type, and the timestamp is mainly used as a filter condition in the query statement; other attributes such as your name, entry, etc. as attribute dimension dimension, usually a string type; how many words have been added and how many words have been removed? this attribute is called measure column, metric, usually a numeric type.

Log data is usually generated in chronological order. It is impossible for the system to generate a 2-point log and a 1-point log. However, after the log is generated, it may be out of order when it is transferred to another place.

Usually, log data is stored in Hadoop, but hadoop does not provide good support for query filtering, which can not meet the interactive query needs of users. At the same time, users need high availability and 0 downtime to restart.

Architecture

There are four types of nodes in Druid. Note that there are four types, and there can be more than one node for each type. Let's introduce one by one:

Real-time node:

These nodes are essentially consumers of Kafka. Used to process data that arrives in real time. Therefore, the nature of these nodes is the same as that of Kafka's consumer. For example, they belong to a consumer group to consume a Topic of Kafka. So their data don't repeat. Of course, they can also spend as different consumer groups, so their data is duplicated, repetition is not necessarily a bad thing, repetition can be made as a copy.

The Real-time node maintains an index in memory, and when the log data arrives, it will first add to the in-memory index, and periodically persist the index and current in-memory data to disk, such as every 10 minutes, or every 10000 pieces of data processed. The data and indexes persisted each time are immutable, which simplifies the system design.

A read-time node is responsible for the data segment is time-limited, for example, the current node only receives 1-2 points of data, when after 2: 00, no longer receive 1-2 points of data, but began to receive 2-3 points of data. Because 1: 00-2: 00 data has been written to disk. A merge task is needed to merge these data and indexes into one. It's called Segment. Segment is the basic unit of Druid data storage.

Historical node:

The Segment collated by the Real-time node is handed over to the underlying storage. The Historical node is responsible for reading the Segment from the underlying storage and reading it into memory for query.

Historical nodes work independently and do not perceive each other. The Historical node maintains a cache and caches part of the Segment. When you need to read a Segment each time, check the Cache first, and if the Cache fails, go to the underlying storage to download the Segment.

Historical nodes can be divided into different levels. Each level can be configured separately. For example, the Historical nodes with good performance in the system can be formed into a hot data layer, and the nodes with average performance can be formed into a cold data layer. In this way, the Historical node of the hot data layer can load data frequently. Mainly for load balancing of heterogeneous clusters.

Broker node

These nodes are responsible for querying routes and merging results. The Broker node also has a cache, which mainly maintains query requests and corresponding results. This result only maintains the query results of the Historical node, because the data of the real-time node is real-time and constantly changing. If the Segment of the Historical node meets the cache, read the cache directly, and then send a request to read the data.

This node is also responsible for merging results. Because multiple nodes may be involved in a query request, you need to merge the results.

Coordinator node:

The meta-information of each Segment is stored in MySQL, and every time the Real-time node processes a Segment, it registers its information with MySQL. In addition, MySQL stores a table of rules to define hot and cold Segment. In this kind of distributed system, the basic function of relational database such as MySQL is to manage system metadata.

The Coordinator node is connected to the MySQL, and after reading all the Segment information, it begins to assign each Segment to each Historical node and is responsible for the load balancing of the Historical node. You can also control the replication factor of Segment. Due to the existence of the copy, each node can be replaced at any time to complete the software upgrade without downtime.

Storage model

Segment is divided by data range and time period. For example, if the data span is 1 year, one Segment a day is more appropriate. Once the Segment is divided by time, it can be filtered by time, which means that time is the primary index.

Second, Segment is stored in columns, and each column can be encoded and compressed. General String type chooses dictionary encoding. RLE, BitMap, etc.

There is nothing special about the storage model, which is basically the characteristics of column storage.

Data partition

The basic data organization of Druid is Segment, which is uniquely determined by data source identifier, time period, an increasing version number, and partition id (partition number). When real-time nodes belong to the same consumer group, the data they consume do not overlap, so the segment generated by these real-time nodes in the same time period is the same version, but the partition id is different. At this point, all the Segment in a period of time form a block, and the query will not be executed until the data of this block is ready.

At this point, the study on "what are the characteristics of Druid" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.