Quickly understand the real-time big data analysis software of Druid-- 04/20 Update SLTechnology News&Howtos

Quickly understand the real-time big data analysis software of Druid--

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

What is Druid?

The word Druid comes from mythological figures in ancient Rome and is often translated into druids in Chinese.

The Druid introduced in this question is a distributed data storage system (Data Store) that supports real-time analysis. MetaMarkets, an American advertising technology company, created the Druid project in 2011 and opened up the Druid project in late 2012. The idea at the beginning of Druid design is for analysis. It has a significant performance improvement over traditional OLAP systems in terms of data processing scale and real-time data processing, and embraces the mainstream open source ecology, including Hadoop. Druid has been a very active open source project for many years.

The official website of Druid is http://druid.io.

In addition, Alibaba also created an open source project called Druid (Ali Druid for short), which is a database connection pool project. Ali Druid has nothing to do with the Druid discussed in this question, they solve completely different problems.

Big data Analysis and Druid

Big data has been a hot topic in recent years. With the rapid growth of data, the scale of data processing has increased from GB level to TB level. Many image applications have begun to deal with PB level data analysis. Big data's core goal is to improve the competitiveness of the business, to find some insights (Actionable Insight) that can take action, data analysis is the core technology, including data collection, processing, modeling and analysis, and finally find ways to improve the business.

In recent years, with the explosive growth of big data's demand for analysis, many companies have experienced the transfer of data platforms based on relational commercial databases to some open source ecological big data platforms, such as Hadoop or Spark, to handle a larger amount of data at a controllable cost of hardware and software. Hadoop was designed to deal with big data in batches, but real-time data processing is often its weakness. For example, in many cases, it is difficult to estimate how long it will take to execute a MapReduce script, which can not meet the analysis requirements that many data analysts expect to return query results in seconds.

In order to solve the problem of real-time data, most companies have an experience of turning data analysis into a more real-time interactive solution. Among them, it involves the introduction of new software and the improvement of data flow. Several common methods of data analysis are shown in the following figure.

The infrastructure of the entire data analysis is usually divided into the following categories.

(1) MR analysis using Hadoop/Spark.

(2) the results of Hadoop/Spark are injected into RDBMS to provide real-time analysis.

(3) inject the results into NoSQL with larger capacity, such as HBase, etc.

(4) streaming the data source, and connecting the streaming computing framework, such as Storm, the result falls in RDBMS/NoSQL.

(5) stream the data source and connect with the analysis database, such as Druid, Vertica and so on.

Three Design principles of Druid

At the beginning of the design, the developer identified three design principles (Design Principle).

(1) Fast query (Fast Query): partial data aggregation (Partial Aggregate) + memory (In-emory) + index (Index).

(2) horizontal scalability (Horizontal Scalability): distributed data (Distributed Data) + parallel query (Parallelizable Query).

(3) Real-time analysis (Realtime Analytics): immutable past, only additional future (Immutable Past,Append-Only Future).

1 Quick query (Fast Query)

For data analysis scenarios, in most cases, we only care about the data aggregated at a certain granularity, rather than the details of each row of raw data. Therefore, the data aggregation granularity can be 1 minute, 5 minutes, 1 hour, or 1 day, and so on. Partial data aggregation (Partial Aggregate) gives Druid a lot of room for performance optimization.

Data memory is also a killer mace to improve the speed of query. The access speed of memory and hard disk is nearly a hundred times different, but the size of memory is very limited, so the use of memory should be carefully designed, such as Bitmap and various compression technologies are used in Druid.

In addition, to support certain dimensions of Drill-Down, Druid maintains some inverted indexes. This approach can speed up computing operations such as AND and OR.

2 horizontal scalability (Horizontal Scalability)

Druid query performance largely depends on the optimal use of memory. Data can be distributed in the memory of multiple nodes, so when the data grows, it can be expanded by simply adding machines. To keep the balance, Druid partitions the aggregated data according to the time range. For high cardinality dimensions, it is sometimes not enough to split by time (each Segment of Druid does not exceed 2000 million rows), so Druid also supports further partitioning of Segment.

Historical Segment data can be stored in a deep storage system, which can be a local disk, HDFS, or remote cloud service. If some nodes fail, Zookeeper can be used to coordinate other nodes to reconstruct the data.

The query module of Druid can perceive and handle the state changes of the cluster, and the query is always carried out in an effective cluster architecture. Queries on the cluster can be flexibly scaled horizontally. Druid built-in provides aggregation operations that are easy to parallelize, such as Count, Mean, Variance, and other query statistics. Some operations that cannot be parallelized, such as Median,Druid, are not supported at this time. In the aspect of supporting histogram (Histogram), Druid is also supported by some approximate calculation methods to ensure the overall query performance of Druid, these approximate calculation methods also include some cardinality calculations of HyperLoglog and DataSketches.

3 Real-time analysis (Realtime Analytics)

Druid provides a storage service that contains time-based data, and any row of data is a historical event, so it is agreed at the beginning of the design that once the event enters the system, it cannot be changed.

Druid for historical data is organized as Segment data files and stored in deep storage systems, such as the file system or Amazon's S3. When the data needs to be queried, Druid loads it into memory from the deep storage system for query use.

Technical characteristics of Druid

Druid has the following technical features.

Large data throughput.

Support streaming data intake and real-time.

The query is flexible and fast.

Strong community support.

1 large data throughput

Many companies choose Druid as their analysis platform, focusing on Druid's data handling capacity. Dealing with billions to tens of billions of events every day is a very suitable scenario for Druid, which has been practiced by a large number of Internet companies. Therefore, many companies choose Druid to solve the problem of data explosion.

2 support streaming data intake

Many data analysis software do a lot of balance between throughput and streaming capacity. For example, Hadoop prefers batch processing, while Storm is a streaming computing platform. There are not many systems that directly connect various streaming data sources at the analysis platform level.

3 query is flexible and fast

The idea of data analysts is often wild, hoping to analyze data from different angles. To solve this problem, OLAP's Star Schema actually defines a good space for data analysts to explore data freely. When the amount of data is small, everything is fine, but when the amount of data becomes larger, the analysis system that can not return the results in seconds is the object of criticism. Therefore, Druid supports query on any combination of dimensions, and its access speed is extremely fast, so it has become the two most important killer mace of the analysis platform.

(4) strong community support.

After Druid is open source, it is favored by many Internet companies, including Yahoo, eBay, Alibaba and so on, of which Yahoo has five Committer, Google has one, and Alibaba has one. Recently, several previous Druid inventors of MetaMarkets have also set up a new company called Imply.io to promote the development of Druid ecology and devote themselves to the prosperity and application of Druid.

Application scenarios of Druid

From the perspective of technical positioning, Druid is a distributed data analysis platform, and it is also very similar to the traditional OLAP system in function, but it makes a lot of focus and trade-offs in the way of implementation. In order to support a larger amount of data, more flexible distributed deployment, and more real-time data intake, Druid leaves out the more complex operations in OLAP query, such as JOIN and so on. Compared with the traditional database, Druid is a kind of time series database, which aggregates the data according to a certain time granularity to speed up the analysis and query.

In the application scenario, Druid started from the advertising data analysis platform, has been widely used in various industries and many Internet companies, the latest list can visit http://druid.io/druidpowered.html.

Druid's ecosystem is expanding and maturing, and Druid is addressing more and more business scenarios. It is hoped that the book "principles and practice of Real-time big data Analysis of Druid" can help technicians to make better technology selection, deeply understand the functions and principles of Druid, and better solve big data's analysis problems.

The major e-commerce websites are in hot pre-sale!

This article is selected from "principles and practice of Real-time big data Analysis of Druid". Click this link to view this book on the official website of the blog.

For more wonderful articles in time, search for "blog viewpoints" on Wechat or scan the QR code below and follow.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.