What is Apache Druid? What scenarios are suitable for? 07/06 Update SLTechnology News&Howtos

What is Apache Druid? What scenarios are suitable for?

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Today, what the editor shares with you is the detailed introduction of Apache Druid. I believe most people don't know much about it. In order to make you understand better, the editor summed up the following content for you. Without saying much, let's move on.

Apache Druid is a high-performance real-time analytical database.

Overview

A modern cloud native, stream native, analytical database

Druid is designed for workflows that quickly query and quickly ingest data. Druid is strong with powerful UI, runtime-operable queries, and high-performance concurrent processing. Druid can be seen as an open source alternative to a data warehouse that meets diverse user scenarios.

Easy integration with existing data pipelines

Druid can stream data (such as Kafka,Amazon Kinesis) from the message bus, or bulk load files (such as HDFS,Amazon S3 and other similar data sources) from the data lake.

100 times faster performance than traditional schemes

Druid's benchmark performance tests for data intake and data query are much better than traditional solutions.

The architecture of Druid combines the best features of data warehouse, time series database and retrieval system.

Unlock the new workflow

Druid unlocks new query methods and workflows for Clickstream,APM (application performance management system), supply chain (supply chain), network telemetry, digital marketing and other event-driven scenarios. Druid is built for fast and temporary queries of real-time and historical data.

Deployed on AWS/GCP/Azure, hybrid cloud, K8s and leased servers

Druid can be deployed in any * NIX environment. Whether it's an internal or cloud environment. The deployment of Druid is very easy: expand and reduce capacity by adding or deleting services.

Working with scen

Apache Druid is suitable for real-time data extraction, high-performance query and high availability scenarios. As a result, Druid is often used as an analysis system with rich GUI, or as a background for highly concurrent API that requires rapid aggregation. Druid is more suitable for event-oriented data.

More common usage scenarios:

Clickstream analysis (web and mobile analysis)

Risk control analysis

Network telemetry analysis (network performance monitoring)

Server metrics storage

Supply chain analysis (manufacturing indicators)

Application performance index

Business Intelligence / Real-time online Analysis system OLAP

These usage scenarios are analyzed in detail below:

User activity and behavior

Druid is often used in clickstream, access stream, and activity stream data. Specific scenarios include: measuring user engagement, tracking Amax B test data for product releases, and understanding how users use it. Druid can accurately and approximately calculate user metrics, such as not repeating counting metrics. This means that, for example, user indicators such as the number of daily active users can calculate an approximate value in one second (with an average accuracy of 98%) to see the overall trend, or accurately calculated to show to stakeholders. Druid can be used to do "funnel analysis" to measure how many users have done one operation but not another. This is very useful for product tracking user registration.

Network flow

Druid is often used to collect and analyze network flow data. Druid is used to manage stream data that is segmented and combined with arbitrary attributes. Druid can extract a large number of network flow records, and can quickly combine and sort dozens of attributes during query, which is helpful to network flow analysis. These attributes include core attributes such as IP and port number, as well as additional enhancements such as geographic location, services, applications, devices, and ASN. Druid can handle non-fixed patterns, which means you can add any attributes you want.

Digital marketing

Druid is often used to store and query online advertising data. These data usually come from advertising service providers, which are essential for measuring and understanding the effectiveness of advertising campaigns, click penetration rate, conversion rate (consumption rate) and so on.

Druid was originally designed as a powerful user-oriented analytical application for advertising data. In terms of storing advertising data, Druid has a large number of production practices, and a large number of users around the world have stored PB-level data on thousands of servers.

Application performance management

Druid is often used to track operational data generated by applications. Similar to the user activity usage scenario, this data can be about how the user interacts with the application, and it can be metric data reported by the application itself. Druid can be used to drill down to discover the performance of different components of the application, locate bottlenecks, and identify problems.

Unlike many traditional solutions, Druid has the characteristics of smaller storage capacity, less complexity, and more big data throughput. It can quickly analyze thousands of attribute application events and calculate complex loading, performance, and utilization indicators. For example, an API terminal based on 95% query latency. We can organize and split data with any temporary attributes, such as days as time, such as user portraits, such as by data center location.

Internet of things and device indicator

Driud can be used as a time series database solution to store and process metric data for servers and devices. Collect machine-generated real-time data, perform fast and temporary analysis, evaluate performance, optimize hardware resources, and locate problems.

Unlike many traditional time series databases, Druid is essentially an analysis engine. Druid integrates the concepts of time series database, column analysis database, and retrieval system. It supports time-based partitioning, column storage, and search indexes in a single system. This means that time-based queries, digital aggregations, and retrieval filtering queries are all very fast.

You can include millions of unique dimensions in your metrics and feel free to combine group and filter by any dimension (the dimension dimension in Druid is similar to tag in a time series database). You can calculate a large number of complex metrics based on tag group and rank. And you can retrieve and filter on tag faster than traditional time series databases.

OLAP and Business Intelligence

Druid is often used in business intelligence scenarios. The company deploys Druid to speed up queries and enhance applications. Unlike Hadoop-based SQL engines such as Presto or Hive, Druid is designed for highly concurrent and subsecond queries and enhances interactive data queries through UI. This makes Druid more suitable for real visual interaction analysis.

Technical

Apache Druid is an open source distributed data storage engine. The core design of Druid combines the concepts of OLAP/analytic databases,timeseries database and search systems to create a unified system for a wide range of use cases. Druid integrates the main features of these three systems into Druid's ingestion layer (data intake layer), storage format (Storage format layer), querying layer (query layer), and core architecture (Core Architecture).

The main features of Druid include:

Column storage

Druid stores and compresses each column of data separately. And when querying, only the specific data to be queried is queried, and fast scan,ranking and groupBy are supported.

Native retrieval index

Druid creates a reverse index on the string value to achieve fast search and filtering of data.

Streaming and bulk data intake

Out-of-the-box Apache kafka,HDFS,AWS S3 connector connectors, stream processor.

Flexible data model

Druid adapts gracefully to changing data schemas and nested data types.

Optimal partitioning based on time

Druid intelligently partitions data based on time. As a result, Druid time-based queries will be significantly faster than traditional databases.

Support for SQL statements

In addition to native JSON-based queries, Druid also supports SQL based on HTTP and JDBC.

Horizontal expansion ability

Million / second data intake rate, massive data storage, subsecond query.

Easy to operate and maintain

You can expand and reduce capacity by adding or removing Server. Druid supports automatic rebalancing and failure transfer.

Data intake

Druid supports both streaming and bulk data intake. Druid usually connects to the original data source through a message bus such as Kafka (loading streaming data) or through a distributed file system such as HDFS (loading bulk data).

Druid stores the original data in the data node in the form of segment through Indexing processing. Segment is a query-optimized data structure.

Data storage

Like most analytical databases, Druid uses column storage. Druid uses different compression and encoding methods for different columns depending on their data type (string,number, etc.). Druid also builds different types of indexes for different column types.

Similar to a retrieval system, Druid creates reverse indexes on string columns for faster search and filtering. Similar to time series databases, Druid intelligently partitions data based on time to achieve faster time-based queries.

Unlike most traditional systems, Druid can pre-aggregate data before it is ingested. This prepolymerization operation is called rollup, which results in significant savings in storage costs.

Query

Druid supports both JSON-over-HTTP and SQL query methods. In addition to the standard SQL operation, Druid also supports a large number of unique operations, using the algorithm suite provided by Druid to quickly count, rank and quantile calculation.

Architecture

Druid is a micro-service architecture that can be understood as a database that is broken down into multiple services. Each of Druid's core services (ingestion (ingestion Service), querying (query Service), and coordination (Coordination Service) can be deployed separately or jointly on commercial hardware.

Druid names each service clearly to ensure that the operation and maintenance staff can adjust the parameters of the corresponding service according to usage and load. For example, when the load is needed, the operators can give more resources to the data intake service and reduce the resources of the data query service.

Druid can fail independently without affecting the operation of other services.

Operation and maintenance

Drui is designed to be a robust system that needs to run 24 hours a day. Druid has the following features to ensure long-term operation and no data loss.

Data copy

Druid creates multiple copies of data based on the number of configured copies, so stand-alone failure does not affect Druid queries.

Independent service

Druid clearly names each main service, and each service can be adjusted according to usage. Services can fail independently without affecting the normal operation of other services. For example, if the data intake service fails, no new data will be loaded into the system, but existing data can still be queried.

Automatic data backup

Druid automatically backs up all indexed data to a file system, which can be a distributed file system, such as HDFS. You can lose all Druid cluster data and quickly reload it from the backup data.

Scrolling update

By scrolling updates, you can update the Druid cluster without downtime, so it is imperceptible to users. All Druid versions are backward compatible.

The above is the knowledge summary of Apache Druid, the content is more comprehensive, the editor believes that there may be some knowledge points that we may see or use in our daily work. I hope you can learn more from this article.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.