What is the principle and usage of big data's analysis query engine Impala 07/01 Update SLTechnology News&Howtos

What is the principle and usage of big data's analysis query engine Impala

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Big data analysis of query engine Impala principle and usage is what, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

I. Overview of Impala

The quasi-real-time analysis system Impala, which provides SQL semantics, can provide fast and interactive SQL queries for-level big data stored in HDFS and Hbase of Hadoop. The bottom layer of the traditional warehouse query tool Hive is based on MapReduce engine processing, which is a batch process, which is difficult to meet the fast response of the query, while Impala is a query system based on MPP, the most important feature is fast.

Second, the composition of Impala components

Impala consists of the following components:

1. Clients:Hue, ODBC clients, JDBC clients and Impala Shell can all interact with Impala, and these interfaces can be used in Impala data query and Impala management.

2. Hive Metastore: stores metadata of data accessible to Impala. For example, this metadata can let Impala know which databases and the structure of the database are accessible. When you create, delete, modify database objects or load data into the data table, the relevant metadata changes will be automatically notified to all Impala nodes in the form of a broadcast. The notification process is completed by catalog service.

3. Cloudera Impala: the Impala process runs on each data node (Datanode). Each instance of Impala can receive queries from the Impala client side to generate execution plans and coordinate execution tasks. Data queries are distributed on various Impala nodes, which act as worker and execute queries in parallel.

4. HBase and HDFS: store the data for the query.

III. Impala system architecture

The whole Impala is divided into two parts: StateStore and Impalad.

StateStore is a sub-service of Impala, which is used to monitor the health status of each node in the cluster and provide node registration, error detection and other functions.

Impalad is a daemon process running on each node of the cluster. One is to coordinate the execution of the Query submitted by Client, assign tasks to other Impalad, and collect the execution results of other Impalad. Second, this Impalad will also perform other tasks assigned by Impalad. The main purpose of this task is to operate on some data in the local HDFS and Hbase.

4. Impala query processing flow

1. Three types of clients can interact with Impala:

Based on driver client ODBC driver and JDBC driver

Hue interface, which can interact with Impala through Hue Beeswax interface

Impala shell Command Line

2. Impala uses Hive Metastore to exist metadata. Impala starts processes on the DataNode of the HDFS cluster and coordinates multiple Impala processes (that is, Impalad) located on the cluster to execute queries. In Impala architecture, each Impala node can receive query requests from the client, then parse queries, produce query plans, and optimize them, and coordinate query requests to be processed in parallel on multiple impalad. Finally, the impala node responsible for receiving the request summarizes the results and responds to the client.

V. the relationship and comparison between Impala and Hive

1. Hive is suitable for long-time batch query analysis, while Impala is suitable for real-time interactive SQL query.

2. Hive performs parallel computing based on MapReduce, while Impala analyzes the whole query into an execution plan tree instead of a series of MapReduce tasks. It uses a query mechanism similar to that in the commercial parallel relational database MPP.

3. The speed of Impala is faster than that of Hive, because Impala does not need to write the intermediate results to disk, which saves a lot of cost of MapReduce O and Impala saves the overhead of starting MapReduce jobs.

4. Impala is suitable for processing queries with moderate or small output data and requiring response time, while MapReduce is still a better choice for batch processing tasks with large amounts of data.

5. Impala can be used with Hive, for example, Hive is used to transform the data, and then Impala is used to analyze the processed data quickly.

VI. Comparison between Impala and Presto

What we have in common is that we eat memory. Of course, when there is sufficient memory and there are clusters of appropriate size, the performance should be more impressive. Impala performance is slightly ahead of presto, but presto is very rich in data source support, including hive, graph database, traditional relational database, Redis and so on.

7. Comparison of impala presto SparkSql performance test

The performance of impala is similar to that of presto, while SparkSql is much inferior.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.