How to tame Apache Impala users through admission control 04/18 Update SLTechnology News&Howtos

How to tame Apache Impala users through admission control

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to tame Apache Impala users through access control, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

A common problem encountered in introducing Apache Impala is resource management. Everyone wants to use as many resources as possible (that is, memory) to try to increase speed and / or hide query inefficiencies. However, this is unfair to others and may not be conducive to queries that support important business processes. What we see from many customers is that there are a lot of resources when rebuilding the cluster and starting to use the initial use case. Before continuing to add more use cases, data scientists, and business units that run ad hoc queries, these resources consume enough resources to prevent those original use cases from completing on time, so there is no need to worry about resources. This can cause the query to fail, which can frustrate the user and cause problems for existing use cases. In order to manage Apache Impala resources effectively, we recommend using the admission control feature. With Admission Control, we can set up a resource pool for Impala. This means limiting the number of queries, the amount of memory, and forcing settings for each query in the resource pool. There are many settings for admission controls, which can be daunting at first. We will focus on memory settings, which are critical to a cluster that already has dozens of active users and applications running.

Step 1: the first challenge in obtaining memory statistics admission control is to manually collect metrics about individual users and the queries they are running to try to define the memory settings for the resource pool. You can manually use the Apache Impala query window and chart builder in Cloudera Manager to traverse each user's query to collect some statistics, but reevaluating later is time-consuming and tedious. In order to make informed and accurate decisions on how to allocate resources to various users and applications, we need to collect detailed metrics. We have written Python scripts to simplify this process.

The script can be found on GitHub: https://github.com/phdata/blog-2019-10-impala-admcontrol this script generates an csv report and does not make any changes. Look at the Readme file and run the script in your environment. The csv report contains overall statistics and statistics per user for: (queries_count)-number of queries run (queries_count_missing_stats)-number of queries run without statistics (aggregate_avg_gb)-average memory used between nodes (aggregate_99th_gb)-maximum memory used across nodes 99% (aggregate_max_gb)-cross-section Maximum memory used by points (per_node_avg_gb)-average memory used per node (per_node_99th_gb)-maximum memory used per node (per_node_max_gb)-maximum memory used per node (duration_avg_minutes)-average query duration in minutes (duration_99th_minutes)-99% lookup Query duration in minutes (duration_max_minutes)-maximum query duration in minutes

Step 2: immediate actions and concerns each workload on each cluster will be different and have a wide range of requirements. When browsing the report, there are some high priority items to look for.

First of all, is the user running the query missing statistics? (count_missing_stats field) if you see that the query does not have statistics at run time, it is recommended that you investigate which tables lack statistics and ensure that calculating statistics is a standard process in your environment. Second, compare the maximum value with column 99. In column 99, we try to explain most of its queries (99%). If any of the largest columns is more than 10-20% higher than the 99th, this will enable us to resolve incorrect or erroneous queries, investigate users' highest queries to see if they are incorrect queries, or whether these queries can be improved to make better use of resources. Aggregate_max to aggregate_99th per_node_max to per_node_99th duration_max to duration_99th

Step 3: resource pool settings in APACHE IMPALA We will define the settings based on this report as:

Maximum running query / maximum queued query default query memory limit maximum memory queue timeout we will guide you step by step on how to determine each setting of the necessary resource pool. Once determined, we will use the create Resource Pool wizard in CM to create each pool, as shown in the following figure. Maximum running / queuing query to really measure this, we need a separate report that records the start time and duration of the query to track the average, 99th percentile, and maximum concurrency per user. For this setting, we recommend that you keep it as low as possible based on the use case, as it will eventually affect the maximum memory you want the user or group of users to use. For simplicity, for the number of queued queries, we set it to the number set for the maximum running query. Default query memory limit this is the maximum amount of memory we want to query for each node. The safest input for this setting is the per_node_max column in our report. The exception is that if you investigate a user's maximum memory usage query and find that per_node_99th can better represent a user's good query, use per_node_99th. Maximum memory this is calculated based on the (default query memory limit * 20 (number of Impala hosts) * maximum number of queries running). For example, if we want the query of each node of the resource pool to be limited to the maximum 4GiB and to be able to run five queries at a time, then the maximum memory is 400GiB. Queue timeout this setting is determined by concurrency, duration, and SLA of the query. If the query must be run within 30 seconds, and the query has been adjusted to run within 20 seconds, the query stays in the queue for more than 10 seconds, which violates SLA. Third-party applications running against Apache Impala may have their own query timeout, which may interfere with the situation where we want to return an immediate error. For long-running ETL workloads, these workloads may eventually cause data skew to increase query duration, and you can extend these timeouts to ensure that all queries are queued and run. Like Cloudera's admission control sample scheme, our cluster has 20 nodes, and the Impala memory on each node is 128gb (Impala totals 2560 GiB). Right away, we can see that three users (svc_account3,user1 and user4) need to be followed up to see if their memory status can be improved by computing, or if several of their queries are poorly written. We should also study svc_account1 because their _ 99th and _ max numbers are so far apart. User's default resource pool: this is our generic pool for anyone with other resources on the platform that do not have a reasonable use case. We have reserved 25% of the cluster resources. Maximum memory: 640 GiB (25% of the cluster) default query memory limit: 3 GiB Max running queries: 10 Max queued queries: 10 queue timeouts: 60 seconds default resource pool for service accounts: this is a regular resource pool for standard workloads generated by applications or planned processes. Maximum memory: 1000 GiB default query memory limit: 5 GiB maximum running query: 10 maximum queued queries: 10 queue timeout: 20 minute superuser resource pool: this is the resource pool for users who need more resources. User3 may be the only user who meets the criteria for the Power Users resource pool. Maximum memory: 400 GiB default query memory limit: 10 GiB maximum number of queries run: 2 maximum number of queued queries: 2 queue timeout: 60 minutes svc_account2 resource pool: in the service account, this is the only account we found that really needs a dedicated resource pool. Maximum memory: 240GiB default query memory limit: 12 GiB maximum number of queries run: 1 maximum number of queued queries: 1 queue timeout: 5 minutes We recommend creating a dedicated resource pool for each service account to ensure that resources are protected and will not be used by standard users. Conclusion after the implementation of the barrier of admission control, our customers will have higher reliability and consistency in the workload. However, some care and feeding are needed. In some cases, the new use case goes through the process of requesting and proving resources beyond the default value. As a reminder, each workload on each cluster is unique, and full admission control may require trial and error. Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.