What are the characteristics of Impala 07/19 Update SLTechnology News&Howtos

What are the characteristics of Impala

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "what are the characteristics of Impala". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what are the characteristics of Impala.

Impala is the open source implementation of Dremel (mass data query tool), which is based on google's three new papers. Its function is similar to shark (dependent on hive) and Drill (apache). Impala is developed and open source by clouder Company. It is based on hive and uses memory for calculation, taking into account data warehouse, and has the advantages of real-time, batch processing, multi-concurrency and so on. Is the preferred PB-level big data real-time query analysis engine using cdh. (there is no problem for Impala to rely on cdh. The official website says that it can be run alone, but there will be a lot of problems if it runs alone.)

Simple comparison between Impala and Shark, sparkSQL, Drill, etc.

Impala started early and is currently one of the few big data query engines that can be used commercially.

CDH5 does not support sparkSQL

Drill starts late and is not yet mature

Shark is functionally and architecturally similar to Impala, and the project has been discontinued.

Characteristics of Impala

Based on memory computing, interactive real-time query / analysis of PB-level data

Read HDFS data directly without conversion to MR

C++ is written and compiled and run by LLVM

Compatible with HiveSQL

It has the characteristics of data warehouse and can analyze hive data directly.

Support for Data Local

Support for column storage

Support for JDBC/ODBC remote access

Supports the sql92 standard and has its own parser and optimizer

Impala core components

As far as impala is concerned, there is no master node, but to understand the two roles of master node, impala statestore and catalog server, it has the function of cluster adjustment. According to the above characteristics, to configure and optimize the impala memory of impala, each deamon needs to configure memory, because the real query work is the node where the deamon is located, so the total memory of impala is the sum of the memory of all deamon nodes. If you want to summarize on which machine, you need to increase the memory on that machine; what we know is that the real provider of the query is deamon, so which one do we connect to? Impala, you can connect any one of the deamon according to your own needs. (1) when you have a relatively large number of queries, you connect to machines with large memory. (2) when each machine is suitable for query, you can also find a random machine and write your own polling or weight algorithm; solve the problem of high concurrency.

Statestore Daemon

Responsible for collecting resource information of each impalad process in the cluster, health status of each node, and synchronizing node information.

Responsible for query scheduling. (not absolutely, if he exists, then help, if not, then do not need him)

For a functioning cluster, it is not a critical process.

Catalog Daemon (added after version 1.2)

Distribute the metadata of the impala table to each impalad, saying that it is based on hive, so you need to distribute the metadata data to the impalad. There is no such process before, but it is synchronized manually. Although added later, but not so smart, it does not guarantee that all the data can be synchronized, for example, if you insert some data, he can send the data to other nodes, but for example, to create a table ddl statement, it is recommended to do it manually. Receive all requests from statestore, when the impala deamon node inserts or queries data (when the data changes), he reports his operation results to statedeamon, and then statestore requests catelog deamon to update the metadata information to impalad, so catalog deamon and statedeamon are on the same machine, and it is not recommended to install impala deamon processes on this machine to avoid cluster management problems caused by providing queries.

Impala Daemon (mainly to provide queries)

Mainly receives query requests, receives client, hue, jdbc or odbc requests, and query executes and returns to the daemon process on the child node of the central coordination node (the corresponding service instance is impalad). It is responsible for maintaining communication and reporting to statestore.

Client (shell,jdbc,odbc) sends requests to the impalad process. Sending nodes can be random, impalad and communicate with each other.

Statestore and catelog are assigned to the same node in order to avoid failures caused by network problems when the two processes work together.

Hive metastore is important, when statestore communicates with catelog and synchronizes data to other nodes

Impalad had better be on the same node as hdfsDataNode, so that you can query and calculate more quickly, and then return the result (ideally, data localization)

Query planner (query parser)

ii. Query coordinator (Central Coordination Node)

Query executor (query executor)

Interpret our string sql statement as an execution plan

The master node (header) to be queried is specified by this component, and after it is specified, the other nodes are informed of the role of my master node, and the results after your query is completed will be returned to the head node.

And it is executor who does the query.

Impalad contains three components

Impala external shell

-h (--help) help-View help documentation for all commands

-r (--refresh_after_connect) refresh all metadata (when hive creates data, you need to refresh to to see the change of hive metadata) refresh as a whole *-full refresh can only be used as a last resort. It is not recommended to refresh hive source data regularly. When the amount of data is too large, a refresh is likely to fail; create a hive table, and then refresh it.

-B (--delimited) de-formatting output * A large amount of data is added to formatting, which affects performance

-- output_delimiter=character specifies that delimiters are integrated with other commands

-- print_header prints column names (unformatted, but displays column names)

-v check the corresponding version (there will be holes)

The query of Impala will be based on the latest version. If the version is inconsistent, the query result will fail.

The version of mpala-shell and impala must be the same.

-f execute query file *

Select name,count (name) as name_count from person group by name-- creates a file containing the sql

-- query_file specifies the query file (it is recommended that the sql statement be written to one line, because shell will read the command of the file line by line)

Impala-shell-query_file=xxx

-I connect to the corresponding impalad

-- impalad assigns impalad to perform tasks

-- fe_port specifies an alternate port (usually not specified)

-o Save the execution result to a file *

-- output_file specifies the output file name

Combined applications:

Impala-shell-B-- Print_header-f test.sss-o result.txtImpala-shell-B-f test.xxx-o result.txt

Non-important shell

Impala-shell-user root

Impala-shell-d database (database specifies the database name)

-- quiet does not display redundant information

Impala-shell-Q "select * from impala.rstest limit 5" >

-- user specifies the user to execute the shell command

-- ssl is executed through ssl authentication

-- ca_cert specifies a third-party user certificate

-- config_file temporarily specifies the profile

-u execute a user to run impala-shell

-p displays the execution plan

-Q do not enter impala-shell to execute query

-c ignore the error statement and continue execution

-d specify access to a database

Mpala-shell command usage:

Impala-shell (internal shell)

Help option

Help

Connect to an impalad instance

Connect

Refresh a table metadata

* refresh / / belongs to incremental refresh

Refresh Metabase

* invalidate metadata / / full refresh, resulting in high performance consumption

Displays the execution plan and step information of a query

* explain / / you can set four levels of set explain_level to start with 0, and usually use level 2. Check the execution plan and other details.

Do not exit impala-shell to execute operating system commands

Shell

Shell ls

Displays the underlying query information (underlying execution plan for performance optimization)

* profile / / execute after the query is completed

The execution plan is stored for analysis

Impala-shell-Q "select name from person"-p > > impalalog.123

View StateStore (Monitoring Management)

-http://cdh2:25020/

View Catalog (Monitoring Management)

-http://cdh3:25010/

Impala Storage and Partition

It should be noted that in addition to all file types that support hive, impala also supports file types such as parquet. Of course, this type is not unique to impala itself, such as spark sql,shark sql. Rcfile itself is faster, but it is not as convenient as text.

Compression mode

Add Partition Mode

-- 1. When partitioned by creates a table, add this field to specify the list of partitions-- 2. Use alter table to add and delete partitions create table t_person (id int, name string, age int) partitioned by (type string); alter table t_person add partition (type='man'); alter table t_person drop partition (type='man'); alter table t_person drop partition (sex='man',type='boss')

Add data within the partition

Insert into t_person partition (type='boss') values (1), (2) insert into t_person partition (type='coder') values (3), (2) Zhaoliuju (28), (24)

Query specified partition data

Select id,name from t_person where type='coder'

Impala-SQL, JDBC, performance optimization

Load data:

Insert statement: each piece of data produces a data file when inserting data. It is not recommended to load bulk data in this way.

Load data mode: it is more appropriate to use this method for batch insertion.

From intermediate tables: this method is used to read files from a large table with a large number of small files and write to a new table to produce a small number of data files. Format conversion can also be done in this way.

Null value handling:

Impala represents "\ n" as NULL. When you use it with sqoop, you should filter the corresponding empty fields. You can also handle it in the following ways:

Alter table name set tblproperties ("serialization.null.format" = "null")

Configuration:

-impala.driver=org.apache.hive.jdbc.HiveDriver- impala.url=jdbc:hive2://node2:21050/;auth=noSasl- impala.username=- impala.password=

Whenever possible, use PreparedStatement to execute SQL statements:

PreparedStatement is better than Statement in performance.

There is a situation where data cannot be queried in Statement.

Carry out the plan

-before the query sql is executed, do an analysis of the sql and list the

Detailed scheme (command: explain sql, profile)

Summary:

1. SQL optimization, using the previous call execution plan

2. Select the appropriate file format for storage

3. Avoid generating a lot of small files (if there are small files generated by other programs, you can use the middle

Table)

4. Use appropriate partition technology and calculate according to the partition granularity.

5. Use compute stats to collect table information

6. Optimization of network io:

a. Avoid sending the entire data to the client

b. Do conditional filtering as much as possible

c. Use the limit sentence

d. Avoid using beautified output when exporting files

7. Use profile to output the underlying information plan and optimize the environment accordingly.

Impala SQL VS HiveQL

Data types are supported

INT

TINYINT

SMALLINT

BIGINT

BOOLEAN

CHAR

VARCHAR

STRING

FLOAT

DOUBLE

REAL

DECIMAL

TIMESTAMP

Types are not supported until after the CDH5.5 version.

ARRAY

MAP

STRUCT

Complex

In addition, Impala does not support the following features of HiveQL:

Covar_pop, covar_samp, corr, percentile,percentile_approx, histogram_numeric, collect_set

Impala only supports: AVG,COUNT,MAX,MIN,SUM

-multiple Distinct queries

-HDF, UDAF

-Extensible mechanisms, such as TRANSFORM, custom file format, custom SerDes

-XML, JSON function

-some aggregate functions:

Impala SQL (similar to Hive)

View

Cannot insert into the view of impala

The insert can come from the view

-create a view: create view v1 as select count (id) as total from tab_3

-query view: select * from v1

-View definition: describe formatted v1

At this point, I believe you have a deeper understanding of "what are the characteristics of Impala?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.