How to analyze Impala 07/08 Update SLTechnology News&Howtos

How to analyze Impala

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces you how to analyze Impala, the content is very detailed, interested friends can use for reference, hope to be helpful to you.

Impala introduction

Impala is a distributed, massively parallel processing (MPP) database engine, mainly composed of CLI, Statestore, Catalog, Impala components as shown in the following figure.

Impala architecture diagram .jpg

Introduction to the functions of each component:

Impala: running on the DataNode node, indicated by the Impalad process, it receives the query request from the client (the Impalad that receives the query request parses the SQL query statement for Coordinator,Coordinator, generates the query plan tree, and then distributes the execution plan to other Impalad with corresponding data through the scheduler for execution), reads and writes the data, executes the query in parallel, and sends the results back to Coordinator through the network stream, and the Coordinator returns to the client. At the same time, Impalad also maintains a connection with State Store to determine which Impalad is healthy and acceptable for new work, and to receive broadcast messages from catalogd services.

Statestore: tracks the health status and location of impalaed in the cluster. When impalaed starts, subscribe registers and maintains a heartbeat connection. And each impalaed caches a copy of state store information. So even if state store dies, impala can still provide services, but cannot update the status of each impalaed.

Catalog: used to synchronize metadata changes generated by impala statements to all nodes in the cluster. Solve the problems before version 1.2: 1) after using CREATE DATABASE, DROP DATABASE, CREATE TABLE, ALTER TABLE, or DROP TABLE statements, you need to execute INVALIDATE METADATA to update the structure object before executing the query on other nodes.

2) after the INSERT statement is executed on one node, the REFRESH is executed on other nodes to identify the new data. However, after passing the hive operation, you still need to execute REFRESH or INVALIDATE METADATA when executing the query on the impala node.

CLI: command-line tools for users to query, shell (Impala implemented in python), JDBC, ODBC interfaces

[edit] 2. Impala applicable scenarios

1) Impala is ideal for performing SQL for interactive exploration and analysis on large datasets.

2) the business logic is relatively simple, similar to the relational database.

3) ad hoc query (Ad-hoc) is the main application scenario of Impala, which is more efficient than hive.

4) the query result set is small and must be smaller than memory, otherwise it will not be executed successfully.

[editor] 3. Relationship between Impala and Hive

Both Impala and Hive are data query tools built on Hadoop with different emphasis on adaptation, but from the point of view of client use, Impala and Hive have a lot in common, such as datasheet metadata, ODBC/JDBC driver, SQL syntax, flexible file format, storage resource pool and so on. The relationship between Impala and Hive in Hadoop is shown in the following figure.

Hive is suitable for long-time batch query analysis, while Impala is suitable for real-time interactive SQL query. Impala provides data analysts with big data analysis tools to quickly experiment and verify ideas. You can first use hive for data conversion processing, and then use Impala to do rapid data analysis on the resulting dataset after Hive processing.

Most HiveQL statements can be run directly under Impala, and the SQL features available in HiveQL are not supported as follows:

Impala only supports simple data types and does not support maps, arrays, structs.

Extensible mechanisms (Extensibility mechanisms) such as TRANSFORM, custom file format, or custom SerDes; zImpala 1.2

Impala does not perform implicit conversion between string and numeric or Boolean, and Impala does not perform implicit conversion between numeric or string to timestamp

XML and JSON functions

Some aggregate functions in HiveQL: variance, var_pop, var_samp, stddev_pop, stddev_samp, covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeric,collect_set; Impala support these aggregate functions: MAX (), MIN (), SUM (), AVG (), COUNT ()

Wait

Impala&hive.jpg

[editor] 4. Why Impala is more efficient than MapReduce

Impala is different from Hive and Pig in that it uses its own daemon to query distributed across clusters. Because Impala does not depend on MapReduce, it avoids the startup overhead of MapReduce jobs and allows Impala to return results in real time.

Impala will not store the intermediate results on the hard disk, maximize the use of memory and transmit it in a strem way through the network in time.

Impala avoids the cost of MapReduce startup time. For interactive queries, the MapReduce startup time becomes very striking. Impala runs as a service with virtually no startup time

Impala can more naturally distribute query plans rather than having to be included in the map and reduce job pipeline. This allows Impala to process multiple steps of the query in parallel and avoid unnecessary loads such as sorting and This enables Impala to parallelize multiple stages of a query and avoid overheads such as sort and shuffle when unnecessary

Impala generates run-time code. Impala uses LLVM to generate a sink code (assembly code) for the query to be executed. Individual queries do not have to pay a price to run on a system that supports a variety of queries (Individual queries do not have to pay the overhead of running on a system that needs to be able to execute arbitrary queries)

Impala uses the latest hardware instructions as much as possible. Impala uses the latest SSE (SSE4.2) instruction set, which can provide great acceleration in some cases

Impala uses a better scheduling of iPlanco. Impala knows the location of blocks on the hard disk and can schedule the order of block processing to ensure that all hard drives are busy. Impala supports direct block reading and native code calculation of checksum.

Impala is designed for performance. Impala adopts performance-oriented design principles, which takes a lot of time, such as tight internal loops, inline function calls, minimum branch, better cache usage, and minimum memory usage (A lot of time has been spent in designing Impala with sound performance-oriented fundamentals, such as tight inner loops, inlined function calls, minimal branching, better use of cache, and minimal memory usage)

Best performance can be achieved by selecting the appropriate data storage format (Impala supports multiple storage formats)

[edit] 5. Memory dependence

Although Impala is not an in-memory database, the impalad daemon allocates a large amount of physical memory when processing large tables and large result sets, and if the memory required to process intermediate result sets on a node exceeds the memory available to Impala on that node, the query is cancelled.

The memory required for Impala operations depends on several factors:

The file format of the table. The same data, using different file formats, the number of data files is also different. In order to analyze the data, different amounts of temporary memory may be required to decompress each file, depending on the compression and encoding format used.

Whether it is a SELECT or INSERT operation. For example, querying Parquet tables requires relatively little memory because Impala reads and decompresses data in 8MB / blocks. Inserting data into the Parquet table is a memory-intensive operation because the data from each data file (the maximum size is 1GB) is placed in memory until it is encoded, compressed and written to the hard disk.

Whether the table is partitioned and whether queries against partitioned tables can benefit from partition pruning (partition pruning)

Impala requires that all queries that contain ORDER BY clauses also contain LIMIT clauses. The orchestrating node needs enough memory to sort. The actual memory is the number of LIMIT * the space where the number of cluster nodes stands.

The size of the result set. When intermediate result sets are transferred between nodes, the amount of data transferred depends on the number of columns returned by the query. Avoid using * when querying select, and only query the required fields.

The amount of memory used is not directly related to the size of the input dataset. For aggregation, the memory used is related to the number of rows after grouping. For table joins, the memory used is related to the size of all tables except the largest table, and Impala can adopt a join strategy of splitting large join tables between each node instead of transferring the entire table to each node

The amount of memory used is not directly related to the size of the input dataset. For aggregation, the memory used is related to the number of rows after grouping. For joins, the memory used is related to the size of all tables except the largest table, and Impala can adopt a join strategy of splitting large join tables between each node instead of transferring the entire table to each node

GROUP BY operation on a unique or high cardinality (high-cardinality) column. Impala assigns some processing structure (handler structures) to each different value in the GROUP BY query. Thousands of different GROUP BY values may exceed the memory limit

Queries involve very wide tables with thousands of columns, especially tables with many STRING columns. Because Impala allows a maximum of 32 KB for STRING values, the intermediate result sets of these queries may require a large amount of memory allocation

Impala uses tcmalloc to allocate memory, a memory divider optimized for high concurrency. As soon as Impala allocates memory, it reserves that memory for future queries. Therefore, it is normal to show that Impala has a high memory usage when idle. If Impala detects that it will exceed the memory limit (defined by the-mem_limit startup option or the MEM_LIMIT query option), it frees up all memory that is not needed by the current query.

On how to analyze Impala to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.