What are the components and architecture of Impala 07/02 Update SLTechnology News&Howtos

What are the components and architecture of Impala

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article focuses on "what are the components and architecture of Impala". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what are the components and architecture of Impala.

I. Overview 1.1 introduction

Impala is a new query system developed by Cloudera, which can query the data stored on HDFS, HBase and S3 quickly and interactively with SQL. In addition, impala and Hive use a unified storage system, the same Metabase, SQL syntax (Hive SQL), ODBC driver and user interaction interface (Hue), Impala provides a unified platform for real-time or batch-oriented queries, and Impala is 30 times better than Hive in performance.

Impala complements the tools used to query big data, and Impala is not a replacement for batch frameworks built on top of MapReduce, such as Hive. Hive and other MapReduce-based frameworks are suitable for long-running batch jobs, such as ETL-type jobs that involve batch processing.

Big data set is a distributed query in a cluster environment, which makes it easy to expand and use cheap commercial hardware to share data before different analysis engines, such as writing data through pig, transforming data using Hive, and querying data using impala. Impala can read and write tables in hive, use impala to analyze the data generated by Hive, and realize a simple data exchange single system for big data processing and analysis, so it can avoid costly modeling and ETL.1.3 main features to support the most common SQL-92 functions of Hive query language (HiveQL). Including SELECT, JOIN and aggregate functions support HDFS, HBase and S3 storage, including: HDFS file format: delimited text files, Parquet, Avro, SequenceFile, and RCFile. Compression: Snappy, GZIP, Deflate, BZIP. Common data access interfaces, including JDBC driver, ODBC driver support impala-shell command line interface Kerberos authorization II, Impala architecture

In order to avoid delay, impala bypasses MapReduce and uses a distributed query engine similar to commercial parallel relational databases, which can directly interact with HDFS and HBase, and its performance is faster than Hive.

Impala server is a distributed massively parallel processing (MPP) database engine that consists of different daemons running on specific hosts in a cluster. Its architecture is shown in the following figure:

2.1Impala Daemon

This process is a daemon running on each DataNode node of the cluster and is the core component of impala. The name of this process on each node is impalad. It is mainly responsible for reading and writing data, accepting query requests from impala-shell,Hue, JDBC or ODBC, working in parallel with other nodes in the cluster, and returning the query results of this node to the central coordinator node (central coordinator).

We can submit a query to any impalad process running on DataNode, and the node that submitted the query will serve the query as the "coordinator" of the query. The operation results of other nodes will be transmitted to the coordinator node, and the coordinator node will return the final operation result. When using the mpala-shell command for functional testing, we always connect to the impalad on the same node for convenience. However, for the impala cluster in the production environment, the load balance of each node must be taken into account. It is recommended to use the JDBC/ODBC interface to submit it to different impalad processes by round-robin.

In order to understand the health and load of other nodes, the Impalad process always communicates with statestore to ensure which node is healthy and acceptable to the task.

When objects are created, modified or deleted in the impala cluster, or INSERT/LOAD DATAT operations are performed, the catalogd process broadcasts messages to all nodes to ensure that each impalad node can keep abreast of the latest status of object metadata in the entire cluster. Communication between background processes minimizes reliance on REFRESH/INVALIDATE METADATA commands. (however, for communication with nodes prior to the impala1.2 version, you still need to display the specified)

For impala 2.9or later, you can control which node is the query coordinator (query coordinators) or which node is the query coordinator (query executors), which can improve the scalability of highly concurrent workloads on large clusters.

2.2Impala Statestore

Statestore checks the health status of the impalad process nodes in the cluster and continuously forwards the health results to all impalad process nodes. The name of the statestore process is statestored. An impala cluster needs only one statestored process. If the impala node is unavailable due to hardware failure, network error, software problem or other reasons, statestore will ensure that this information is communicated to all impalad process nodes in a timely manner. When there is a new query request, the impalad process node will not relax the query request to the unavailable node.

Because the purpose of statestore is to synchronize information to impalad process nodes in the event of a cluster failure, it is not a critical process for a functioning impala cluster. If statestore is not available, impalad process nodes can still coordinate with each other to provide distributed queries to the outside world. When statestore is not available, the impalad process node fails, only to make the cluster less robust. When the statestore returns to normal, it re-establishes communication with the impalad process node and resumes the monitoring function of the cluster.

Both load balancing and high availability apply to the impalad daemon. Statestore and catalog processes have no special requirements for high availability because even if there is a problem with these daemons, they will not result in data loss. If these daemons become unavailable due to interruption, you can stop the impala service, remove the impala StateStore and impala Catalog roles, add roles to different hosts, and restart the impala service.

2.3Impala Catalog Service

When SQL statements executed in an impala cluster cause metadata changes, the catalog service pushes those changes to other impalad process nodes. The corresponding process name for the catalog service is catalogd, and only one catalogd process is needed for an impala cluster. Since all requests are sent through the statestore process, it is recommended that statestore and catalog run on the same node.

Catalog services greatly reduce the need for metadata synchronization of REFRESH / INVALIDATE METADATA statements. During the creation and deletion of tables, the catalogd process is responsible for connecting to the Metabase and performing metadata updates, ensuring that metadata synchronization statements such as REFRESH / INVALIDATE METADATA are not required. However, if you perform operations such as creating tables, loading data, and so on through Hive, you need to execute the REFRESH or INVALIDATE METADATA command before executing the query in impala.

3. The execution process of Impala query 3.1Impala query process chart 3.2Impala specific process of query execution

Step 0, before the user submits the query, Impala creates an Impalad process responsible for coordinating the query submitted by the client, which submits the registration subscription information to Impala State Store, State Store creates a statestored process, and the statestored process processes the registration subscription information of Impala by creating multiple threads.

In the first step, the user submits a query to the impalad process through the CLI client, and the Query Planner of Impalad parses the SQL statement to generate a parsing tree; then, Planner changes the parsing tree of the query into several PlanFragment, which is sent to Query Coordinator

In step 2, Coordinator gets the data address from the name node of the HDFS by getting the metadata from the MySQL Metabase to get all the data nodes that store the data related to the query.

Step 3, Coordinator initializes the task execution on the corresponding impalad, that is, assigning the query task to all data nodes that store the data related to the query.

In step 4, the Query Executor streams the intermediate output, and the Query Coordinator aggregates the results from each impalad.

Step 5, Coordinator returns the summarized results to the CLI client.

IV. Comparison of Impala and Hive 4.1Impala and Hive comparison of the similarities between 4.2Hive and Impala

Hive uses the same storage data pool as Impala, and both support storing data in HDFS and HBase

Hive uses the same metadata as Impala

The interpretation of SQL in Hive is similar to that in Impala, and the execution plan is generated through lexical analysis.

The differences between 4.3Hive and Impala

Hive is suitable for long-time batch query analysis, while Impala is suitable for real-time interactive SQL query.

Hive relies on the MapReduce computing framework. Impala represents the execution plan as a complete execution plan tree, and directly distributes the execution plan to each Impalad to execute the query.

During the execution of Hive, if not all the data is stored in it, external memory will be used to ensure that the query can be executed sequentially, while Impala will not use external memory when there is no data to be stored in it, so Impala will be limited in processing queries at present.

At this point, I believe you have a deeper understanding of "what are the components and architecture of Impala?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.