How to build an interactive query system based on Impala platform 07/15 Update SLTechnology News&Howtos

How to build an interactive query system based on Impala platform

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to build an interactive query system based on Impala platform, which is very detailed and has a certain reference value. Interested friends must finish reading it!

The first feature of interactive query is the large amount of data, the second relational schema is relatively complex, depending on your design, there are many kinds of relational patterns. There is also a higher response time requirement, for the maximum number of query return time is less than 10 seconds; according to the different amount of data to choose different storage, for million-level data using MySQL,PostgreSQL, for millions-10 billion level, the traditional database can not meet, using analytical data warehouse to achieve Impala,Presto, Green Plum, Apache Drill It is difficult to do big data analysis above 10 billion level, using offline data warehouse and hive,spark.

For many practical wide tables in the BE system, because there are many dimensions, a user may have hundreds of dimensions after slow information accumulation. If you filter a 50 dimensions, it will be easy to use wide tables combined with some special data structures such as inversion. Elastic Search, Solr is a search engine, Click House is a better performance system developed in Russia, but the support of join is limited, Druid is used more in advertising platform. There is also a combination model, such as Elastic Search, Solr is more used, the typical is Green Plum,Presto,Impala.

Next, let's talk about what factors determine us to choose a platform, the first is our own project familiarity, if the project leader is familiar with this platform, he will choose this platform. If you are not familiar with the project, you will choose a big factory to endorse it and use the same application as a big company. If there is none of the first two, then evaluate whether or not to adapt to the system in terms of performance, advantages and disadvantages.

Focus on the third point, the first is the amount of data, according to the system data capacity, the platform should at least achieve my minimum performance index. Another is architecture complexity, a system will eventually go online, to ensure that CLA, if the architecture is complex, there will be many problems; therefore, choose a relatively simple architecture. The last one is the cost of operation and maintenance, the cost of operation and maintenance is very high, so it is impossible to make frequent changes; if you want to change something you need to be familiar with the platform, it will affect your selection.

Next, let's talk about how we do the selection, mainly considering Impala, Presto, and Greenplum. The first consideration is the data source, a lot of our data is on HDFS, so Greenplum is definitely not suitable, because it is completely closed, is a self-made storage architecture. The community environment and structure are all the same, but there is little difference in structure. Impala is slightly better than Presto in terms of performance. There are other features, such as programming languages, C++ runs a little faster than Java, so it is more likely to choose the platform written by C++. Finally, Impala is chosen.

These three are all MPP architectures, and the entire execution node of Impala is stateless, so there is no problem for down to lose a node and start it again. Impala is compatible with hive storage, and some points such as Apache top-level projects, mature communities, compatibility of multiple data source formats, and efficient query performance are all unique selection factors that we consider.

Next, let's talk about the Impala architecture, which is compatible with a variety of data sources, that is, metastore directly connects with various DB and uses catalogd to provide metadata services. Can be directly connected to the DB can also be through the catalogd, usually using the metastore in hive to obtain data. The reason why Impala is efficient is that it caches the original data, and catalogd startup browses the cache for data. It has a statestored service, which is a publish and subscribe service, and all status and rotation occurs in the statestored service. On the left is the execution node of impala. All queries are sent to these nodes, and after execution, the node will be sent to all relevant nodes. The whole impala is stateless, and all connectors are like a coordinator.

Catalogd is a metadata service, its main problem is that when you do select, impala will also cache part of the data, it will not enter the catalogd service, but doing DDL operations will apply the catalogd service. Statestored (sub/pub) has a lot of topic, and all the impala nodes subscribe to the relevant messages on these topic. Statestored actually makes a message subscription on many topic. The Impala node includes SQL parsing, execution plan generation, data query, aggregation, and result return.

The figure above shows how each node is coordinated when a query comes in. If a query enters this node, this node is Query Planner, which is responsible for generating the execution plan, transmitting the plan to the surrounding nodes, and finally returning the result to Query Planner. If there is an aggregation, it will first aggregate and then return to the total Query Planner, and then carry out related aggregation to return the result.

Impala has the advantage of metadata caching, and impala caches the information of the corresponding table data on HDFS in blog, so there will be a local read when querying to determine whether the metadata is local. Log can only connect the data through local read. The second point is parallel computing, where Query Planner generates an execution plan to send it to the surrounding nodes, and then converges. The third one uses codegen technology, some of which generate execution code according to the execution environment, which will greatly improve performance. Another is a lot of operator push-down, if the pursuit of high performance is not allowed to achieve operator push-down, the interaction between the storage layer and the computing layer becomes smaller, filtering in the underlying layer rather than in the computing layer, which will greatly improve the overall performance of the platform.

Broadcast join caches small tables on all nodes when large tables are associated, and then returns data for aggregation. Partition join should be a big data table for both tables. For example, if an event table accumulates tens of billions of data, and there are 500 million users, then it cannot be bound to all nodes through broadcast join, so do some partition join operations on each node and then go to it. There is also a CBO, which is not very accurate at present, and sometimes has a great deviation. Where there is parallel computing, there is parallel aggregation, which aggregates in advance before data generation, and aggregates and merges according to the column of group by.

Next, we will introduce which storage engines impala supports, commonly used are hdfs, and kudu, a product produced to solve the interaction between HDFS and HBASE. Hbase is mainly a kb query, but if there are a large number of scans, the performance is poor, and mass scanning is the strength of HDFS, but can not do kb queries. Alluxio is a file record exchange cache, and the underlying layer can also interface with HDFS, supporting multi-level caching. We do Alluxio mainly to deal with thermal data, and we used to use caching to solve this problem.

If you want to use the impala platform to achieve docking, first of all, it has the whole authorization and authentication mechanism. Authentication can be docked with kerberos, LDAP and Audit log, and only identity authentication can access the system. Authorization is enabled through Apache Sentry with granularity of database, table and column, and permissions of select, insert and all configuration enabled (authorization_policy_provider_class=org.apache.sentry.provider.file.Local G roup R esource A uthorization P rovider). These are some of the things you have to do if you want to go online.

For a platform, there are many users who do some tasks on it, which need to be managed. At present, the Admission Control mechanism is adopted, which can ensure that there is a direct user configuration on every impala node, and each queue can set the total amount of resources and the resource size of each SQL. This configuration is for impala nodes. For example, if 300G is set for a user, only 2-3G is allocated to each node, and it is forbidden to exceed this limit. Resource isolation should be considered both generally and separately. Impala nodes synchronize information through the impalad-statistics topic items of statestored. Because statestored keeps communicating with impalad through heartbeat, this resource information is actually delayed. In the current configuration, only memory items have actual effect, and vcore does not achieve isolation. If the queue name configuration is the same as the authentication user name, the SQL submitted by the user is automatically assigned to the queue.

Impala has a web end, which is simple but useful, and the whole problem solving and positioning are often used. Each component provides a web end and assigns the corresponding port. The basic information includes cluster node, Catalog information, memory information, and Query information. The Web side can view the memory consumption of the case node (memory consumption of each challenge, memory consumption of each query at this point), query analysis of the node (query analysis, SQL diagnosis, abnormal query termination), and Metrics information view. The picture above shows some of the queues we have allocated, the resource consumption of each queue, and so on. Using impala to do join analysis, the execution plan in each SQL is specified, and the tags on the interface, such as query, summary, memory, etc., can be analyzed by SQL.

Talked about the advantages, characteristics, how to use impala, but based on the open source platform, there are also many shortcomings. The first Catalogd&statestored service is a single point, but fortunately, the query is not affected, and if the Catalogd dies, the metadata update will not be synchronized to the entire impala node. Statestored hangs, and updates will not be synchronized, only the previous information will be protected. The second is that the web information is not persistent, and the information displayed is stored in the historical information. If the impala is restarted, the information will be gone. Resource isolation is not accurate, the underlying storage can not distinguish between users, and load balancer, each impala can be connected to SQL, but it is difficult to solve how to connect 100impala, so haproxy is implemented for impala. Also, synchronizing with hive metadata requires manual operation, and impala is caching metadata, which is not perceived through HDFS operations.

There are defects, there are improvements, first of all, based on ZK load balance, because impala is tied to hive, hive server is based on ZK, you need to access the uri of impala into a dimension, hive native is based on ZK multi-dimensional node access. The second is the management server, because the information on the impala page will not be saved, use the management server to save these things, and check on the management server when checking, not because you have many impala nodes and the information will not be saved. Fine-grained permissions-agents, access to HDFS through impala to achieve the underlying permission control. Json format, this is partial to the application requirements. Compatible with ranger rights management, because our entire project rights management is based on ranger. Batch metadata refresh is also a problem in practical application, sometimes dozens of tables will be changed at a time, and it will be troublesome to refresh each time. Metadata synchronization, transform hive and impala, each time hive changes, the changes will be written to the middle tier, impala to obtain the middle tier to achieve synchronization. Metadata filtering, when the amount of data is very large, in fact, interactive query is not needed for a large part of the table, while impala only needs a certain part, so regular expressions are used to filter out unnecessary data. Docking ElasticSearch queries and pushing down the operators involved in ES, such as multi-dimensional filtering queries, aggregates data faster than hash according to inverted attributes.

In the Impala application scenario, the figure above shows a departmental big data platform architecture, from kafka data to HDFS, structured to semi-structured, which is the access of data. After data cleaning, and then connected to the upper layer, the upper layer uses ES storage, and the top directly uses impala to query, which is basically the framework of the analysis system.

Above is one of our BI products, called Qishu. The bottom layer is also docked with the impala platform, which is a data analysis and reporting platform that connects charts with data on the map. Write structured data or unstructured data directly to hive, and then perceive it through impala to achieve metadata synchronization, and users directly query through impala. Need to consider the problem of metadata synchronization, ETL write data impala is not aware, rely on metadata synchronization; data real-time problem, to avoid a large number of small files leading to NN instability, each write file batch can not be too small. Another solution is to use kudu to solve the problem of small files, write real-time data into kudu, and jointly check kudu and hdfs. You can see both kudu and hdfs tables on impala.

The above is all the contents of the article "how to build an interactive query system based on Impala platform". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.