How to understand Impala metadata 07/01 Update SLTechnology News&Howtos

How to understand Impala metadata

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is to share with you about how to understand Impala metadata, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

Background

Impala is a high-performance OLAP query engine. Different from other SQL-on-Hadoop ROLAP solutions such as Presto and SparkSQL, Impala caches metadata (Metadata/Catalog), so it no longer depends on external systems (such as Hive, HDFS, Kudu) when generating query plans, and can achieve millisecond generation time. In addition, caching metadata can also greatly reduce the load on the underlying system Master nodes (Hive Metastore, HDFS NameNode, Kudu Master).

However, there are always two sides to things. Metadata caching introduces a great deal of complexity to the system design of Impala. On the one hand, in order to maintain the correctness of the cache, two Impala-specific SQL statements: Invalidate Metadata and Refresh are introduced, and query option SYNC_DDL is also introduced. All these allow users to participate in the maintenance of the cache. On the other hand, in architecture design, there are many complex designs related to metadata, such as incremental propagation of metadata, maintenance of cache consistency and so on.

When the warehouse scale is large, the metadata design of Impala exposes many problems, such as the metadata loading and refreshing time of large tables is very long, the broadcast of metadata will be blocked by DDL, resulting in large broadcast delay, metadata cache leads to node Full GC or OOM, and so on. Therefore, Impala metadata design has been evolving, and the latest progress is mainly focused on the design of Fetch-on-demand coordinator (also known as local catalog mode, catalog-v2, etc.).

Introduction to Impala Server

The Impala cluster consists of a Catalog Server (Catalogd), a Statestore Server (Statestored), and several Impala Daemon (Impalad). Catalogd is mainly responsible for the acquisition of metadata and the execution of DDL, Statestored is mainly responsible for the broadcast of messages / metadata, and Impalad is mainly responsible for receiving and executing queries.

Impalad can also be configured in three modes: coordinator only, executor only or coordinator and executor (default). The Impalad of the Coordinator role is responsible for the receiving, planning and scheduling of the query, while the Impalad of the Executor role is responsible for reading and calculating the data. By default, each Impalad is both Coordinator and Executor. The production environment recommends the separation of roles, that is, each Impalad is either a Coordinator or an Executor, and can be configured at 1:50. Please refer to the official document [1] for more details.

The composition of Impala metadata

The metadata for Impala is cached in catalogd and in the Impalad of each Coordinator role. The cache in Catalogd is up-to-date, and each Coordinator caches a copy of the metadata within Catalogd. As shown in the following figure, the metadata is obtained by Catalogd to external systems and propagated to each Coordinator through Statestored.

Metadata caching is mainly implemented by Java code, and the body code is in FE. There is also some code for C++ implementation, which mainly deals with the interaction between FE and BE, as well as metadata broadcasting. In the code, the same metadata management logic in Catalogd and Coordinator (Impalad) is extracted and placed in Catalog.java, and the implementation in Catalogd is ImpaladCatalog.java in CatalogServiceCatalog.java,Coordinator.

Catalog is a hierarchical structure. The first layer is the mapping from db name to db, and each layer is function map and table map under each db:

Catalog

|-- dbCache_ = Map

|-- functions_ = Map

|-- tableCache_ = CatalogObjectCache

The value in functions_ Map is a Function list, which mainly represents different overloads of functions with the same name. TableCache_ is maintained by CatalogObjectCache. CatalogObjectCache encapsulates a ConcurrentHashMap, plus version management logic, such as avoiding low-version updates from overwriting high-version caches, tracking all cached version numbers, and so on. These versioning logic is particularly important in Impalad. We will describe it in detail in subsequent articles.

Table has five specific subclasses in the code: HdfsTable, KuduTable, HBaseTable, View, IncompleteTable, and DataSourceTable. The first four are straightforward. Explain the last two:

IncompleteTable represents a table or view (View) with no metadata loaded. When Catalogd starts, in order to reduce startup time, only the table names of all tables are loaded, and each table is represented by IncompleteTable. If INVALIDATE METADATA is executed, the metadata of the table is also emptied, which is represented by a reset to IncompleteTable. IncompleteTable may represent a view, but this cannot be determined when the metadata is not loaded. Therefore, when using Impala in visual interfaces such as HUE, you will often see a View represented by the icon of Table, but once it has been used, it becomes the icon of View again.

DataSourceTable is an implementation of external data source, which is not mentioned in any documentation because it has been in an experimental state. Its original intention is to provide a Java interface to customize external data sources, only need to implement prepare, open, getNext, close these interfaces. For more information, please refer to EchoDataSource and AllTypesDataSource in the code.

Next, we focus on the metadata composition of the first three.

HdfsTable

HdfsTable represents an Hive table that is stored as HDFS at the underlying level. The metadata of the unpartitioned table is relatively simple, and there is less metadata corresponding to each partition. Here, take the partition table as an example, whose metadata is shown in the figure:

Where msTable and msPartition represent the objects returned in HMS API:

Org.apache.hadoop.hive.metastore.api.Tableorg.apache.hadoop.hive.metastore.api.Partition

HdfsPartition represents the metadata of a partition, and a large part of it is information about HDFS files and blocks. The FileDescriptor and BlockDescriptor in the figure are generated after extracting data from the FileStatus and BlockLocation objects returned in HDFS API. To save space, the actual cache is not the FileDescriptor and FileBlock shown in the figure above. IMPALA-4029 introduces FlatBuffer to compress FileDescriptor and FileBlock. The advantage of FlatBuffer is that it does not need to be serialized and deserialized like protobuf or thrift, but it can directly access the contents of the object, while bringing a certain compression ratio. For more information about FlatBuffer, see the document at the end of the article [2].

Another large part of HdfsPartition is incremental statistics, which are cached data compressed by the deflate algorithm, as detailed in: PartitionStatsUtil#partStatsFromCompressedBytes (). After decompression, there is a TPartitionStats object, which mainly contains the statistical information of each column in the partition. The statistical information of each column is represented by a TIntermediateColumnStats:

Struct TIntermediateColumnStats {

1: intermediate result of optional binary intermediate_ndv / / NDV HLL calculation

2: whether the optional bool is_ndv_encoded / / HLL intermediate result is useful for RLE compression

3: optional i64 num_nulls / / the number of NULL listed in this partition

4: optional i32 max_width / / the maximum length of the column in this partition

5: optional double avg_width / / the average length of the column in the partition

6: optional i64 num_rows / / the number of rows in this partition, which is used to aggregate HLL intermediate results

}

For the implementation of ndv () in Impala, please refer to the logic of HllInit (), HllUpdate (), HllMerge () and HllFinalEstimate () in be/src/exprs/aggregate-functions-ir.cc. The intermediate result of the ndv is represented by a string with a length of 1024. RLE (Run Length Encoding) compression is generally used during transmission.

The statistics of Impala are limited to Hive (because they are saved in Hive Metastore), and there is currently no information such as maximum, minimum, average, and so on, for columns of statistical value types. This piece has an old JIRA: IMPALA-2416, and there has been no progress so far.

The size of the metadata of an HDFS partitioned table in memory after various compressions is about

Number of partitions * 2048 + number of partitions * columns * 400 + number of files * 500 + number of blocks * 150

In practical application, in order to reduce the metadata size of large tables, it is necessary to seek optimization space in the number of partitions, columns, files and blocks. The numbers 2048, 400,500,150 are all estimates of the compressed size of each object. "number of partitions * columns * 200" refers to the size of incremental statistics. If the statistics of the table are non-incremental, that is, they are always counted in Compute Stats, this part is not needed. In practical applications, it is rare to directly Compute Stats large tables, because the execution time may be very long, generally using Compute Incremental Stats, so this part of the memory footprint can not be ignored.

KuduTable

HdfsTable represents an Hive table that is stored as Kudu at the underlying level. The Kudu metadata cached by Impala is particularly limited:

MsTable: the Table object returned by HMS API, mainly the metadata in Hive

TableStats: the statistical information stored in HMS, mainly the statistics of each column and the number of rows of the entire table, etc.

KuduTableName: the actual table name in the Kudu store, which can be different from the table name in Hive.

KuduMasters: master list of Kudu clusters

PrimaryKeyColumnNames: primary key column of the Kudu table

Partitions: partition information of the Kudu table

KuduSchema: the Schema information returned by Kudu API

With regard to partition information, only the columns of the partition and the number of partitions of the hash partition are cached, not what the individual Range of the Range partition is, so when you use the SHOW CREATE TABLE statement, the range partition information you see contains only the column name. For example, in the following example, each range in the "Partition by range (id)" section is omitted:

Query: show create table functional_kudu.dimtbl

+- -+

| | result |

+- -+

| | CREATE TABLE functional_kudu.dimtbl (| |

| | id BIGINT NOT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |

| | name STRING NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |

| | zip INT NULL ENCODING AUTO_ENCODING COMPRESSION DEFAULT_COMPRESSION, |

| | PRIMARY KEY (id) |

|) |

| | PARTITION BY RANGE (id) (...) |

| | STORED AS KUDU |

| | TBLPROPERTIES ('STATS_GENERATED'='TASK',' impala.lastComputeStatsTime'='1573922577', 'kudu.master_addresses'='localhost',' numRows'='10') | |

+- -+

If you need to see which range partitions there are, or if you still need to use a SHOW RANGE PARTITIONS statement, Impala will get the results from the Kudu to return, but the range information will not be cached.

Query: show range partitions functional_kudu.dimtbl

+-- +

| | RANGE (id) |

+-- +

| | VALUES < 1004 | |

| | 1004 = 1008 | |

+-- +

Fetched 3 row (s) in 0.07s

Personally, I think there is still a lot of work to be done, such as caching the demarcation point of the range partition, which can be used to optimize Insert statements and improve the performance of bulk import Kudu (IMPALA-7751). In addition, more detailed information, such as each kudu tablet copy location, kudu tserver address, etc., are not cached, the use of this information can actually do a lot of optimization, welcome to participate in the development!

HBaseTable

Impala's support for HBase began with compatibility with Hive (Hive can read HBase's data), but it is now in a state of maintenance and the community is no longer investing in this area. On the one hand, Kudu is more suitable to replace HBase to do OLAP, on the other hand, Impala is not suitable for high concurrency DML operations.

HBaseTable represents the underlying Hive table stored as HBase, caching basic statistics such as Table definition and table size (number of rows) in HMS, as well as all column family names of the underlying HBase table.

Summary

Impala caches the metadata of external systems (Hive, HDFS, Kudu, etc.) so that the query plan generation phase no longer needs to interact with the external system. Metadata is uniformly obtained by Catalogd to external systems and broadcast to all Coordinator through Statestored.

Which metadata is needed to generate the query plan, and which metadata is cached:

Table: Schema (table name, field name, field type, partition field, etc.), column statistics

HdfsTable: partition directory, file path, file partition and replica location, incremental statistics of each partition

KuduTable: partition column and partition type (Hash, Range)

HBaseTable: each column family name

View: specific query statements

The above is how to understand Impala metadata, the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.