How SparkStreaming writes Hive latency 07/02 Update SLTechnology News&Howtos

How SparkStreaming writes Hive latency

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to write SparkStreaming Hive delay, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's learn about it!

Background:

Hive version: 1.2.1 Magi Spark version: 2.3.0, real-time program logic is relatively simple, from Kafka consumption data, write to Hive table.

The data order of magnitude is hundreds of millions, and the bath time of SparkStreaming is 1 min. Task accumulation begins at a certain time, that is, a large number of tasks are in Queued state, stuck in a certain job, and the longest delay time is 1.7h.

Checking the job status has been in processing, but it only takes about 30 seconds for the job to write the hive, but the final execution of the job takes much longer than that.

Slowly, each batch is a few minutes slower and accumulates, resulting in a large area of data delay.

Analysis:

Write part of the logic code to Hive, which is simple, as follows:

/ / the conversion process of the above RDD is slightly.... toDF.write.mode (SaveMode.Append) .insertInto ("ods.user_events")

By looking at the source code of Hive, we found that:

If you read the source code above, you can find that when you write data to Hive, you will generate a lin folder starting with. Hive-staging in the target table (the folder of the default location target table after version 1.1), and the result will be stored in a temporary folder. After the execution is completed, put the temporary folder rename under the corresponding target table file.

The rename here is not as simple as directly modifying hive metadata. Mv file is executed only under certain conditions, otherwise it will still be in the form of copy file.

If the source and destination directories are the same root directory, each file under the source directory is copied. Instead, perform remane operations (only namenode metadata is involved, no additional data operations).

Source code reference: https://github.com/apache/hive/blob/23db35e092ce1d09c5993b45c8b0f790505fc1a5/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java

After hive 1.1, temporary files are placed directly under the directory corresponding to the target table, so the final copy operation, if there are many files or a large amount of data, will be very slow.

Resolve:

Option 1: modify the temporary directory

When the hive.exec.stagingdir / tmp/hive/.hive-staging hive task generates a temporary folder address hive.insert.into.multilevel.dirs true hive.insert.into.mulltilevel.dirs is set to false, the parent directory of the insert target directory must exist; when trued is allowed, it does not exist

Scheme 2: the spark falls directly into the corresponding partition of the HDFS, and the hive table is associated with the data in the external table. This kind of does not depend on and hive, reduces the intermediate link. This is to avoid small files as much as possible and reduce the number of files as much as possible.

The above is all the content of the article "how SparkStreaming writes Hive delay". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.