How Kudu uses Bloom filters to optimize joins and filters 04/11 Update SLTechnology News&Howtos

How Kudu uses Bloom filters to optimize joins and filters

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how Kudu uses Bloom filter to optimize connection and filtering, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

Introduction

In database systems, one of the most effective ways to improve performance is to avoid performing unnecessary tasks, such as network transfer and reading data from disk. One way for Apache Kudu to do this is to support column predicates by using a scanner. Pushing column predicate filters down to Kudu optimizes execution by skipping reading column values of filtered rows and reducing network IO between clients such as distributed query engines Apache Impala and Kudu. For more information, see the documentation on runtime filtering in Impala.

CDP Runtime 7.1.5 and CDP Public Cloud added support for push-down of Bloom filter column predicates in Kudu and related integration in Impala.

Bloom filter (Bloom Filter)

Bloom filter is a space-saving probabilistic data structure used to test set membership in which there may be false positive matches. In a database system, these are only used to determine whether a set of data can be ignored when only a subset of records is needed. See the Wikipedia page for more details.

The implementation used in Kudu is a block-based Bloom filter based on space, hash and cache in the "High Speed, Hash and Space efficient Bloon filter" of Putze et al. This Bloom filter comes from the implementation of Impala and has been further enhanced. Block-based Bloom filters are designed to fit the CPU cache and allow SIMD operations using AVX2 (if available) for efficient lookup and insertion.

Consider broadcasting hash joins between small and large tables that are not available under the predicate. This usually involves the following steps:

Read the entire small table and construct a hash table from it.

Broadcast the generated hash table to all work nodes.

On the work node, start getting and iterating over the slices of the large table, check whether the keys in the large table exist in the hash table, and return only matching rows.

Step 3 is the heaviest task because it involves reading the entire large table, and if the working program and the node hosting the large table are not on the same server, it may involve heavy network IO.

Prior to 7.1.5, Impala supported pushing only the minimum / maximum (MIN_MAX) runtime filter down to Kudu to filter out values that were not within the specified range. In addition to the MIN_MAX runtime filter, Impala in CDP 7.1.5 + now supports pushing the runtime Bloom filter down to Kudu. With the newly introduced Bloom filtering predicate support in Kudu, Impala can use this feature to perform more efficient joins on data stored in Kudu.

Performance

As in the case above, we ran an Impala query that joins a large table stored on Kudu with a small table stored in Parquet format on HDFS. The small table was created using Parquet on HDFS to isolate new functionality, but it can also be stored in Kudu. We first use only the MIN_MAX filter, and then run the query using MIN_MAX and Bloom filters (all run-time filters). For comparison, we created the same large table in the Parquet of HDFS. Using Parquet on HDFS is a good benchmark because Impala already supports MIN_MAX and Bloom filters for Parquet on HDFS.

Set up

The following tests were performed on a 6-node cluster with CDP runtime 7.1.5.

Hardware configuration:

Dell PowerEdge R430, 2.2chz @ 20c / 40t Xeon E5-2630 v4, 128GB Ram,4 block 2TB hard disk for WAL,3 disk for data directory.

Schema:

The large table consists of 260 million rows, where randomly generated data hashes are partitioned by the primary key across 20 partitions on the Kudu. The Kudu table has been explicitly rebalanced to ensure a balanced layout after loading.

The small table consists of the first 1000 keys and 2000 rows of the last 1000 keys in the large table of Parquet stored on HDFS. This will prevent the MIN_MAX filter from doing any filtering on the large table because all rows will fall within the scope of the MIN_MAX filter.

COMPUTE STATS is run on all tables to help gather information about the table metadata and to help Impala optimize the query plan.

All queries have been run 10 times, and the average query run time is shown below.

Join query

For join queries, by using the Bloom filter predicate to push down, we found that the performance of Kudu has been improved by three to five times. We expect to see better performance multiples through larger data sizes and more selective queries.

The performance of Kudu is now about 17-33% better than Parquet on HDFS.

Update query

For update queries that basically insert the entire small table into an existing large table, we see a 15-fold improvement. This is mainly due to improved query performance when selecting rows to update.

For more information about the schema of the table, the loading process, and the query running, see the reference section below.

TPC-H

We also ran the TPC-H benchmark on a single-node cluster with a scale factor of 30, and the performance improved by 19% to 31% under different block cache capacity settings.

Kudu automatically disables Bloom filtering predicates that cannot effectively filter data to avoid performance losses caused by new features. Query 9 in the TPCH benchmark (TPCH-Q9) showed a 50-96% regression during feature development. In further investigation, the time required to scan rows from Kudu has increased by up to three times. When investigating this regression, we found that less than 10% of the rows were filtered out by the push-down Bloom filter predicate, resulting in an increase in CPU usage in Kudu, whose value outweighed the advantages of the filter. To solve the regression problem, we added a heuristic to Kudu where if the Bloom filter predicate does not filter a sufficient percentage of rows, it is automatically disabled for the rest of the scan.

Functional availability

Users who use Impala to query Kudu will enable this feature by default from CDP 7.1.5 and from the CDP public cloud. We strongly recommend that users upgrade to get this performance enhancement and many other performance enhancements in the release. For custom applications that directly use the Kudu client API, the Kudu C + + client also has Bloom filter predicates available since CDP 7.1.5. The Bloom filter predicate KUDU-3221 has not been provided by the Kudu Java client.

After reading the above, do you have any further understanding of how Kudu uses Bloom filters to optimize connections and filtering? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.