How to tune Hive on Spark parameters 07/19 Update SLTechnology News&Howtos

How to tune Hive on Spark parameters

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to tune Hive on Spark parameters. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Preface

Hive on Spark refers to the use of Spark instead of traditional MapReduce as the executive engine of Hive, which is proposed in HIVE-7292. Hive on Spark is much more efficient than on MR, but it also requires reasonable adjustment of parameters to maximize performance. This article briefly lists some tuning items. In order to meet the actual situation, Spark also uses the on YARN deployment method to explain.

Driver parameter

Spark.driver.cores

This parameter represents the number of CPU cores available for each Executor. The value should not be set too large, because the bottom layer of Hive is stored in HDFS, and HDFS sometimes does not handle high concurrent writes very well, which is easy to cause race condition. According to our practice, it is reasonable to set it between 3 and 6.

Suppose we are using a server with 32 CPU cores available on a single node. Considering the margin of system basic services and components such as HDFS, the yarn.nodemanager.resource.cpu-vcores parameter of YARN NodeManager is generally set to 28, that is, YARN can make use of 28 cores. It is most appropriate to set spark.executor.cores to 4, which can be allocated to up to 7 Executor without causing waste. Also assuming that yarn.nodemanager.resource.cpu-vcores is 26, it is most appropriate to set spark.executor.cores to 5, with only one core left.

Since an Executor needs a YARN Container to run, it is also necessary to ensure that the value of the spark.executor.cores cannot be greater than the maximum number of cores that a single Container can apply for, that is, the value of yarn.scheduler.maximum-allocation-vcores.

Spark.executor.memory/spark.yarn.executor.memoryOverhead

These two parameters represent the amount of in-heap memory and out-of-stack memory available for each Executor, respectively. The more memory in the heap, the more data Executor can cache, making it faster to do operations such as map join, but it also makes GC more cumbersome. Hive officially provides an empirical formula for calculating the total amount of memory in Executor, as follows:

Yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores)

In fact, it is distributed in proportion to the number of cores. Of the calculated total memory, 80% to 85% is allocated to in-heap memory and the rest to out-of-heap memory.

Assuming that a single node in the cluster has 128 GB of physical memory and yarn.nodemanager.resource.memory-mb (that is, the amount of host memory that a single NodeManager can utilize) is set to 120 GB, then the total amount of memory is: 120 * 1024 * (4 / 28) ≈ 17554MB. Then according to the proportion of 8:2, the final spark.executor.memory is set to about 13166MB spark.yarn.executor.progresyOverhead is set to about 4389MB.

In the same way as in the previous section, the total amount of these two memory parameters cannot exceed the maximum amount of memory that a single Container can apply for, that is, yarn.scheduler.maximum-allocation-mb.

Spark.executor.instances

This parameter indicates how many Executor instances are started when the query is executed, depending on the resource allocation of each node and the number of nodes in the cluster. If we have a total of 10 32C/128G nodes and follow the above configuration (that is, each node carries 7 Executor), then theoretically we can set the spark.executor.instances to 70 to maximize the utilization of cluster resources. But in fact, it is generally set to be smaller (it is recommended that it is about half of the theoretical value), because Driver also takes up resources, and a YARN cluster often carries other services besides Hive on Spark.

Spark.dynamicAllocation.enabled

The fixed allocation of the number of Executor mentioned above may be inflexible, especially if the Hive cluster provides analysis services to many users. Therefore, it is more recommended to set the spark.dynamicAllocation.enabled parameter to true to enable Executor dynamic allocation.

Driver parameter spark.driver.cores

This parameter represents the number of CPU cores available for each Driver. A setting of 1 is sufficient in most cases.

Spark.driver.memory/spark.driver.memoryOverhead

These two parameters represent the amount of in-heap memory and out-of-stack memory available for each Driver, respectively. According to the degree of resource surplus and the size of the job, the total amount is generally controlled between 512MB~4GB, and the "28 allocation mode" of Executor memory is followed. For example, spark.driver.memory can be set to about 819MB. Driver. poweryOverhead is set to about 205MB, which adds up to exactly 1G.

Hive parameter

Most of the Hive parameters have the same meaning and tuning method as in on MR, but there are still two things to note.

Hive.auto.convert.join.noconditionaltask.size

We know that when one side of the join table in Hive is a small table, if both the hive.auto.convert.join and hive.auto.convert.join.noconditionaltask switches are true (which is the default), it will be automatically converted to a more efficient map-side join. The parameter hive.auto.convert.join.noconditionaltask.size is the threshold for map join conversion, which defaults to 10MB under Hive on MR.

However, the size of the statistical table under Hive on MR uses the approximate size of the data stored on disk, while under Hive on Spark, it uses the approximate size stored in memory. Since the data on HDFS is likely to be compressed or serialized, resulting in a reduction in size, this parameter should be appropriately increased when migrating from MR to Spark to ensure that map join is converted properly. It will generally be set to 100~200MB or so. If there is plenty of memory, it can be bigger.

Hive.merge.sparkfiles

Small files are the natural enemies of HDFS, so Hive natively provides the option to merge small files, which is hive.merge.mapredfiles when on MR, but will be changed to hive.merge.sparkfiles when on Spark, so be sure to set this parameter to true. As for the threshold parameters for small file merging, that is, hive.merge.smallfiles.avgsize and hive.merge.size.per.task remain unchanged.

This is the end of the article on "how to tune Hive on Spark parameters". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.