How to get started with Druid Real-time OLAP data Analysis Storage system 04/18 Update SLTechnology News&Howtos

How to get started with Druid Real-time OLAP data Analysis Storage system

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you how to get started with Druid real-time OLAP data analysis and storage system. The article is rich in content and analyzes and describes for you from a professional point of view. I hope you can get something after reading this article.

Brief introduction

Druid is an open source, distributed, column storage storage system suitable for real-time data analysis, capable of fast aggregation, flexible filtering, millisecond query, and low-latency data import.

Druid is designed with high availability in mind, and the death of various nodes will not cause druid to stop working (but the status will not be updated)

The coupling between the various components of Druid is low, and real-time nodes can be ignored if real-time data is not needed.

Druid uses Bitmap indexing to accelerate the query speed of column storage, and uses the CONCISE algorithm to compress bitmap indexing, so that the generated segments is much smaller than the original text file.

Architecture overall architecture

Druid clusters contain different types of nodes, and each node is designed to do a certain set of things. Such a design can isolate concerns and simplify the complexity of the entire system.

The operation of different nodes is almost independent and has minimal interaction with other nodes, so communication failures in the cluster have little impact on data availability.

The composition and data flow of the Druid cluster is shown in figure 1:

(figure 1)

Druid itself contains five kinds of nodes: Realtime, Historical, Coordinator, Broker, Indexer

The Historical history node is a workspace for storing and querying "historical" data (not real-time). It loads data segments (Data/Segments) from the deep storage area (Deep Storage), responds to query requests from the Broker node, and returns results.

The history node usually synchronizes some segments on the local deep storage area, so even if the deep storage area is inaccessible, the history node can still query the data segments that have been synchronized.

The Realtime real-time node is the workspace for storing and querying real-time data, and it also responds to the query request of the Broker node and returns the results.

The real-time node periodically establishes the data into segments and moves it to the history node.

The Coordinator orchestration node can be thought of as the master in Druid, which manages history nodes and real-time nodes through Zookeeper, and manages data segments through metadata in Mysql.

The Broker node responds to the external query request, forwards the request to the history node and the real-time node by querying Zookeeper, and finally merges and returns the query results to the outside. The Broker node decides which historical node and real-time node provide services through zookeeper.

The Indexer index node is responsible for data import, loading batches and real-time data into the system, and can modify the data stored in the system.

Druid contains three external dependencies: Mysql, Deep storage, and Zookeeper

Mysql:

Stores about the metadata in Druid instead of storing the actual data, and contains 3 tables:

"druid_config" (usually empty), "druid_rules" (some rule information used by collaborating nodes, such as which segment goes to load from which node), and "druid_segments" (stores metadata information for each segment)

Deep storage: storage segments,Druid currently supports local disk, NFS mount disk, HDFS,S3 and so on.

There are two sources of Deep Storage data, one is batch data intake, the other is from real-time nodes.

ZooKeeper: used by Druid to manage the status of the current cluster, such as recording which segments has been moved from real-time nodes to historical nodes

Real-time node

Real-time nodes encapsulate the function of importing and querying event data, and the event data imported through these nodes can be queried immediately. The real-time node only cares about the event data in a short period of time and regularly imports the data collected during this period into the deep storage area. Real-time nodes announce their online status and the data they provide through Zookeeper.

(figure 2)

As shown in figure 2, the real-time node caches the event data to the in-memory index and then persists it to disk regularly. The persisted indexes are merged periodically before the transfer. The query hits both in-memory and persisted indexes. All the real-time nodes periodically start the background scheduling task to search local persistent indexes, and the background planning tasks merge these persistent indexes together and generate an immutable piece of data. These data blocks contain all the event data that has been imported by the real-time node over a period of time, which is called "Segment". During the transfer phase, the real-time node uploads these segment to a persistent backup storage, usually a distributed file system, such as S3 or HDFS, called "Deep Storage" (deep storage).

Historical node

History nodes follow the architecture of shared-nothing, so there is no single point of problem between nodes. Nodes are independent of each other and the services provided are simple, they just need to know how to load, delete, and handle Segment. Similar to real-time nodes, history nodes advertise their online status and which data they serve in Zookeeper. The instructions to load and delete segment are issued through Zookeeper, and the instructions contain information about where the segment is stored in the deep storage and how to extract and process the segment.

(figure 3)

As shown in figure 3, before the history node downloads a segment from the deep storage area, it will check the local cache information to see if the segment already exists in the node. If the segment does not already exist in the cache, the history node will download the segment from the deep storage area to the local. After this phase is completed, the segment will be advertised in the Zookeeper. At this point, the segment can be queried, and the segment needs to be loaded into memory before the query.

Coordination node

The coordinator node is mainly responsible for the management of the Segment and the distribution on the history node. The orchestration node tells the history node to load new data, unload expired data, copy data, and move data for load balancing. In order to maintain a stable view, Druid uses a multi-version concurrency control exchange protocol to manage immutable segment. If any immutable segment contains data that has been completely obsolete by the new segment, the expired segment will be unloaded from the cluster. The coordinator node will go through a leader election process to decide that an independent node will perform the coordination function, and the rest of the coordination node will act as a redundant backup node.

Broker node

Broker node is the query route of history node and real-time node. If the Broker node knows the information of the segment published in Zookeeper, the Broker node can route the incoming query request to the correct history node or real-time node, and the Broker node will merge the local results of the history node and the real-time node, and then return the final merged result to the caller. The Broker node contains a cache that supports the LRU failure policy.

(figure 4)

As shown in figure 4, each time a Broker node receives a query request, it first maps the query to a set of segment. The results of this set of identified segment may already exist in the cache without the need for recalculation. For results that do not exist in the cache, the Broker node forwards the query to the correct history node and the real-time node, and once the history node returns the results, the Broker node caches the results for later use, as shown in figure 6. Real-time data is never cached, so query requests for data from real-time nodes are always forwarded to real-time nodes. Real-time data is constantly changing, so caching real-time data is unreliable.

Indexer node

Indexing service is a highly available, distributed service that runs indexing tasks. The indexing service creates (and sometimes destroys) the Segment of Druid. The indexing service has a master / slave architecture.

(figure 5)

The indexing service consists of three main components: the peon component that can run a single task, the middle-level management component used to manage peon, and the overlord component that manages tasks assigned to the middle-level management component. The overlord component and the middle management component can run on the same node or across multiple nodes, while the middle management component and the peon component always run on the same node.

ZooKeeper

Druid uses ZooKeeper (ZK) to manage the current cluster state. Actions that occur on ZK are as follows:

1. Coordinate the leader election of nodes

two。 Historical and real-time nodes release segment protocol

3. Coordinate the segment Load/Drop protocol between nodes and history nodes

Leader Election of 4.overlord

5. Index service task management

Druid vs other systems Druid vs Impala/Shark

The comparison between Druid, Impala and Shark basically boils down to what kind of system needs to be designed.

Druid is designed to:

Services that are always online

Get real-time data

Handle slice-n-dice-style real-time queries

The query speed is different:

Druid is a column storage mode, and the data is compressed and added to the index structure. Compression increases the data storage capacity in RAM, which can make RAM adapt to more data fast access.

The index structure means that when you add a filter to query, Druid does less processing and will query faster.

Impala/Shark can be thought of as the daemon cache layer above HDFS.

But they didn't go beyond the caching function to really improve the query speed.

The acquisition of data is different:

Druid can get real-time data.

Impala/Shark is based on HDFS or other backup storage, which limits the speed of data acquisition.

The form of the query is different:

Druid supports time series and groupby-style queries, but not join.

Impala/Shark supports SQL-style queries.

Druid vs Elasticsearch

Elasticsearch (ES) is an Apache Lucene-based search server. It provides a mode of full-text search and provides access to raw event-level data. Elasticsearch also provides analysis and aggregation support. According to the research, ES uses more resources in data acquisition and aggregation than in Druid.

Druid focuses on the OLAP workflow. Druid is high-performance (fast aggregation and acquisition) optimized at low cost and supports a wide range of analysis operations. Druid provides some basic search support for structured event data.

Druid vs Spark

Spark establishes a cluster computing framework based on the concept of resilient distributed data set (RDD), which can be regarded as a background analysis platform. RDD enables data reuse to keep the intermediate results in memory, providing Spark with an iterative algorithm for fast calculation. This is especially beneficial for some workflows, such as machine learning, where the same operation can be applied over and over again until there are results. Spark provides analysts with the ability to run a variety of queries and analyze large amounts of data with different algorithms.

Druid focuses on data acquisition and providing services for querying data. If a web interface is established, users can view the data at will.

The above is the introduction to the Druid real-time OLAP data analysis and storage system shared by Xiaobian. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.