How to analyze Apache Druid 04/19 Update SLTechnology News&Howtos

How to analyze Apache Druid

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to analyze Apache Druid, the content is very detailed, interested friends can refer to it, I hope it can help you.

overview

Apache Druid is a high-performance real-time analytics database.

A modern cloud-native, stream-native, analytical database

Druid is designed for fast query and fast data ingestion workflows. Druid is strong in having a powerful UI, runtime actionable queries, and high-performance concurrent processing. Druid can be seen as an open source alternative to a data warehouse that caters to diverse user scenarios.

Easily integrate with existing data pipelines

Druid can stream data from message buses (such as Kafka, Amazon Kinesis) or bulk load files from data lakes (such as HDFS, Amazon S3, and other similar data sources).

100 times faster performance than traditional solutions

Druid benchmarks performance for data ingestion and data queries significantly outperform traditional solutions.

Druid's architecture combines the best features of data warehouses, time series databases and retrieval systems.

Unlock new workflows

Druid unlocks new query ways and workflows for Clickstream, APM(Application Performance Management System), supply chain, web telemetry, digital marketing, and other event-driven forms of scenarios. Druid is built for fast ad hoc queries of real-time and historical data.

Deployed on AWS/GCP/Azure, hybrid cloud, k8s and leased servers

Druid can be deployed in any *NIX environment. Whether it's an internal environment or a cloud environment. Deploying Druid is very easy: scaling by adding or removing services.

usage scenarios

Apache Druid is suitable for scenarios with high requirements for real-time data extraction, high performance queries, and high availability. Therefore, Druid is often presented as an analytics system with a rich GUI or as a back-end for a highly concurrent API that requires rapid aggregation. Druid is better suited for event-oriented data.

More common usage scenarios:

Clickstream analytics (web and mobile analytics)

Risk control analysis

Network telemetry analysis (network performance monitoring)

Server metrics storage

Supply Chain Analysis (Manufacturing Indicators)

application performance metrics

Business Intelligence/Real-Time Online Analysis System OLAP

These usage scenarios are analyzed in detail below:

User activity and behavior

Druid is often used for clickstream, visitstream, and activitystream data. Specific scenarios include measuring user engagement, tracking A/B test data for product launches, and understanding how users use it. Druid can calculate user metrics accurately and approximately, such as not counting metrics repeatedly. This means, for example, that daily user metrics can be calculated in one second to approximate values (average accuracy 98%) to see overall trends, or precisely calculated to present to stakeholders. Druid can be used to do "funnel analysis," measuring how many users do one action and don't do another. This is useful for tracking user registrations.

network flow

Druid is often used to collect and analyze network traffic data. Druid is used to manage stream data that is split into arbitrary attribute combinations. Druid is able to extract a large number of network flow records and quickly combine and sort dozens of attributes at query time, which helps network flow analysis. These attributes include core attributes such as IP and port numbers, as well as additional enhanced attributes such as geolocation, services, applications, devices, and ASNs. Druid is able to handle non-fixed patterns, which means you can add whatever attributes you want.

digital marketing

Druid is often used to store and query online advertising data. This data, usually from ad servers, is crucial to measuring and understanding metrics such as campaign effectiveness, click-through rates, conversion rates (consumption rates), and more.

Druid was originally designed as a powerful user-oriented analytics application for ad data. Druid already has a lot of production practices when it comes to storing ad data, with a large number of users worldwide storing petabytes of data on thousands of servers.

application performance management

Druid is often used to track actionable data generated by applications. Similar to user activity usage scenarios, this data can be about how users interact with the application, and it can be metrics reported by the application itself. Druid can be used to drill down to discover how different components of an application perform, locate bottlenecks, and discover problems.

Unlike many traditional solutions, Druid features smaller storage capacity, less complexity, and greater data throughput. It can quickly analyze thousands of attribute application events and calculate complex load, performance, utilization metrics. For example, API terminals based on 95 percent query latency. We can organize and segment data by any temporary attribute, such as dividing data by day, such as user profile statistics, such as statistics by data center location.

IoT and device metrics

Driud can be used as a time series database solution to store metrics data for processing servers and devices. Collect real-time machine-generated data, perform rapid ad hoc analysis, measure performance, optimize hardware resources, and locate problems.

Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid combines the concepts of time series databases, column analysis databases, and retrieval systems. It supports time-based partitioning, columnar storage, and search indexing in a single system. This means time-based queries, numerical aggregations, and retrieval filtering queries are all exceptionally fast.

You can include millions of unique dimension values in your metrics and combine groups and filters by any dimension you want (dimension in Druid is similar to tag in time series databases). You can calculate a large number of complex metrics based on tag group and rank. And you can retrieve and filter on tags much faster than traditional time-series databases.

OLAP and Business Intelligence

Druid is often used in business intelligence scenarios. Companies deploy Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines such as Presto or Hive, Druid is designed for high concurrency and sub-second queries, enhancing interactive data queries through the UI. This makes Druid more suitable for real visual interaction analysis.

technology

Apache Druid is an open source distributed data storage engine. Druid's core design incorporates the concepts of OLAP/analytic databases, timeseries databases, and search systems to create a unified system for a wide range of use cases. Druid incorporates the main features of these three systems into Druid's ingestion layer, storage format layer, querying layer, and core architecture.

img

Druid's main features include:

columnar storage

Druid stores and compresses each column of data separately. And query only the data that needs to be queried, support fast scan, ranking and groupBy.

Native Search Index

Druid creates inverted indexes for string values for fast searching and filtering of data.

streaming and bulk data intake

Out of the box Apache kafka, HDFS, AWS S3 connectors, streaming processors.

Flexible data model

Druid elegantly adapts to changing data schemas and nested data types.

Time-based optimization partitioning

Druid intelligently partitions data based on time. Druid time-based queries will therefore be significantly faster than traditional databases.

Support SQL statements

In addition to native JSON-based queries, Druid also supports HTTP and JDBC based SQL.

horizontal scalability

Millions/second data intake rate, massive data storage, sub-second queries.

Easy to operate

You can expand and shrink capacity by adding or removing servers. Druid supports automatic rebalancing and failover.

data intake

Druid supports both streaming and bulk data ingestion. Druid typically connects raw data sources through message buses like Kafka (loading streaming data) or through distributed file systems like HDFS (loading bulk data).

Druid stores the original data in the data node in the form of segments through Indexing processing. A segment is a query optimized data structure.

img Data Storage

Druid, like most analytical databases, is stored columnically. Druid compresses and encodes columns differently depending on their data type (string, number, etc.). Druid also builds different types of indexes for different column types.

Similar to the retrieval system, Druid creates inverted indexes for string columns for faster searching and filtering. Similar to time-series databases, Druid intelligently partitions data based on time for faster time-based queries.

Unlike most traditional systems, Druid can pre-aggregate data before it is ingested. This pre-aggregation operation is called rollup, and can result in significant savings in storage costs.

img Query

Druid supports JSON-over-HTTP and SQL queries. In addition to standard SQL operations, Druid also supports a large number of unique operations, using Druid's algorithm suite to quickly perform counting, ranking and quantile calculations.

img architecture

Druid is a microservices architecture that can be understood as a database that is broken down into multiple services. Each of Druid's core services (ingestion, querying, and coordination) can be deployed individually or jointly on commercial hardware.

Druid clearly names each service to ensure that OPS personnel can adjust the parameters of the corresponding service according to usage and load conditions. For example, when the load demands it, operations personnel can allocate more resources to data ingestion services and less resources to data query services.

Druid can fail independently without affecting the operation of other services.

diagram-7 operation and maintenance

Drui is designed to be a robust system that takes 7*24 hours to run. Druid has the following features to ensure long-term operation without data loss.

data copies

Druid creates multiple copies of data based on the number of copies configured, so a single failure does not affect Druid queries.

independent service

Druid clearly names each main service, and each service can be adjusted accordingly according to usage. Services can fail independently without affecting the normal operation of other services. For example, if the data ingestion service fails, no new data will be loaded into the system, but existing data can still be queried.

automatic data backup

Druid automatically backs up all indexed data to a file system, which can be a distributed file system such as HDFS. You can lose all Druid cluster data and quickly reload from backup data.

rolling update

By rolling updates, you can update Druid clusters without downtime, which is insensitive to users. All Druid versions are backward compatible.

About how to analyze Apache Druid to share here, I hope the above content can be of some help to everyone, you can learn more knowledge. If you think the article is good, you can share it so that more people can see it.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.