How to use Druid component to realize data Statistical Analysis in OLAP 07/12 Update SLTechnology News&Howtos

How to use Druid component to realize data Statistical Analysis in OLAP

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about how to use Druid components in OLAP to achieve data statistical analysis, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

1. Introduction to Druid

Druid is an OLAP engine based on distributed architecture, which supports data writing, low latency, high-performance data analysis, excellent data aggregation and real-time query capabilities. There are related application scenarios in big data analysis, real-time computing, monitoring and other fields, which is an important component of big data's infrastructure construction.

Compared with the relatively popular Clickhouse engine, Druid has relatively good and stable support for high concurrency, but Clickhouse has excellent data query ability in task queue mode, but it is not friendly enough to support high concurrency, so it needs to do a lot of service monitoring and early warning. There are many choices of OLAP engines in big data components. There are usually two or more OLAP engines in the data query engine layer. Choosing appropriate components to solve business requirements is a priority.

2. Basic characteristics

Distributed system

In the distributed OLAP data engine, the data is distributed among multiple service nodes. When the amount of data increases sharply, the capacity can be expanded horizontally by adding nodes, and the data can be backed up each other in multiple nodes. If a single node fails, the data can be reconstructed based on the Zookeeper scheduling mechanism, which is the basic feature of the distributed OLAP engine. This strategy has been mentioned in the Clickhouse series before.

Aggregate query

It mainly provides low-latency data writing and fast aggregation query for time series data, which can be queried by writing the characteristics of time series database. Druid will pre-aggregate data when data is written, thus reducing the amount of original data, saving storage space and improving query efficiency; data aggregation granularity can be based on specific strategies, such as minutes, hours, days, and so on. It must be emphasized that Druid is suitable for data analysis scenarios and is not suitable for businesses querying a single data primary key.

Column storage

Druid is column-oriented and can perform large-scale parallel queries in a cluster, which means that query speed can be greatly improved when only the columns needed for a particular query need to be loaded.

3. Infrastructure

Ruler node

Namely Overlord-Node, the management node of the task, the process monitors the MiddleManager process, and is the controller of the data intake Druid, which is responsible for assigning the extraction task to the MiddleManagers and coordinating the Segement release.

Coordination node

Coordinator-Node, which is mainly responsible for data management and distribution on the history node, the coordination node tells the history node to load new data, unload expired data, copy data, and move data for load balancing.

Intermediate management node

MiddleManager-Node, which takes in real-time data, has generated a Segment data file, which can be understood as the working node of the overlord node.

Historical node

Historical-Node, which is mainly responsible for historical data storage and query, receives coordination node data loading and deletion instructions. Historical node is the core of the query performance of the whole cluster, because historical will undertake most of the segment queries.

Query node

Broker-Node, which acts as the query routing for history nodes and real-time nodes, receives client-side query requests and forwards these queries to Historicals and MiddleManagers. When Brokers receives results from these subqueries, they merge the results and return them to the caller.

Data file repository

DeepStorage, which stores the generated Segment data file.

Meta database

MetadataStorage, which stores metadata information about the Druid cluster, such as information related to Segment.

Coordination middleware

Zookeeper, or Zookeeper, provides coordination services for Druid clusters, such as internal service monitoring, coordination and leader election.

2. Druid deployment 1. Installation package

Imply integrates with druid and provides a complete solution from deployment to configuration to various visualization tools.

Https://static.imply.io/release/imply-2.7.10.tar.gz

Extract and rename.

[root@hop01 opt] # tar-zxvf imply-2.7.10.tar.gz [root@hop01 opt] # mv imply-2.7.10 imply2.72, Zookeeper configuration

Configure the nodes of the Zookeeper cluster, separated by commas.

[root@hop01 _ common] # cd / opt/imply2.7/conf/druid/_ Common [root @ hop01 _ common] # vim common.runtime.properties druid.zk.service.host=hop01:2181,hop02:2181,hop03:2181

Turn off the Zookeeper built-in check and do not start.

[root@hop01 supervise] # cd / opt/imply2.7/conf/supervise [root@hop01 supervise] # vim quickstart.conf

Comment out the following:

3. Service startup

Start the related components in turn: Zookeeper, Hadoop related components, and then start the imply service.

[root@hop01 imply2.7] # / opt/imply2.7/bin/supervise-c / opt/imply2.7/conf/supervise/quickstart.conf

Pay attention to the memory problem of the virtual machine, Druid the JVM configuration of each component in the following directory, lower it appropriately if the condition does not allow, and increase the memory parameters related to JVM.

[root@hop01 druid] # cd / opt/imply2.7/conf/druid

Launch the default port: 9095, and the access interface is as follows:

3. Basic usage 1. Data source configuration

Choose the above Http method, based on the JSON test file provided by imply.

Https://static.imply.io/data/wikipedia.json.gz

2. Data online loading

Perform the above: Sample and continue.

Sample data loading configuration:

Configuration of data columns:

General overview of configuration items:

Finally, you can perform the data loading task.

3. Local sample load [root@hop01 imply2.7] # bin/post-index-task-- file quickstart/wikipedia-index.json

Read two data scripts like this.

4. Data cube

After the data is loaded, view the visual data cube:

Some basic view analysis is provided in the data cube, which can split datasets and analyze data in multiple dimensions:

5. SQL query

You can do SQL queries on Druid based on visualization tools, and the syntax is almost the same as the common rules:

SELECT COUNT (*) AS Edits FROM wikipedia;SELECT * FROM wikipedia WHERE "_ time" BETWEEN TIMESTAMP 'start' AND TIMESTAMP 'end'; SELECT page, COUNT (*) AS Edits FROM wikipedia GROUP BY page LIMIT 2 / select * FROM wikipedia ORDER BY _ _ time DESC LIMIT 5 * select * FROM wikipedia LIMIT 3

6. Segment file

File location:

/ opt/imply2.7/var/druid/segments/wikipedia/

Druid realizes data cutting based on Segment, data is distributed according to time sequence, and data in different time ranges are stored in different Segment data blocks. When querying data according to time range, the efficiency of full data scanning can be avoided greatly. At the same time, column-oriented data compression storage can improve the efficiency of analysis.

The above is how to use Druid components in OLAP to achieve data statistical analysis. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.