How the EMR Spark engine improves the write performance by more than 10 times under the separation of deposit and calculation 07/12 Update SLTechnology News&Howtos

How the EMR Spark engine improves the write performance by more than 10 times under the separation of deposit and calculation

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces how the EMR Spark engine can improve the write performance by more than 10 times under the separation of deposit and account. the content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Introduction

With the evolution of big data's technical architecture, the architecture of storage and computing separation can better meet the needs of users to reduce data storage costs and schedule computing resources on demand, which is becoming the choice of more and more people. Compared with HDFS, data storage on object storage can save storage costs, but at the same time, object storage has much worse write performance to massive files.

MapReduce (EMR) is an elastic open source pan-Hadoop service hosted by Tencent Cloud, which supports EMR frameworks such as Spark, Hbase, Presto, Flink, Druid and so on.

Recently, when supporting an EMR customer, I encountered a typical storage computing separation application scenario. The customer uses the Spark component in EMR as the computing engine, and the data is stored on the object store. In the process of helping customers to tune the technology, it is found that the write performance of Spark is relatively low in massive file scenarios, which affects the overall performance of the architecture.

After in-depth analysis and optimization, we finally greatly improved the write performance, especially the write object storage performance by more than 10 times, accelerated business processing, and won praise from customers.

The following describes how the EMR Spark computing engine improves write performance in massive file scenarios in the storage computing separation architecture.

I. background of the question

Apache Spark is a fast and general computing engine designed for large-scale data processing, which can be used to build large-scale, low-latency data analysis applications. Spark is an open source Hadoop MapReduce-like general parallel framework developed by UC Berkeley AMP lab (AMP Lab of the University of California, Berkeley). Spark has the advantages of Hadoop MapReduce.

Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as local collection objects. Although Spark was created to support iterative jobs on distributed datasets, it is actually a complement to Hadoop, which can be run in parallel in the Hadoop file system or on cloud storage.

In the process of technology tuning, the computing engine we studied is the Spark component of EMR products. Because of its excellent performance and other advantages, it has become the choice of big data computing engine for more and more customers.

On storage, the customer chooses object storage. In terms of data storage, object storage has the characteristics of reliability, scalability and lower cost, which is a better way of low-cost storage than Hadoop file system HDFS. Massive warm and cold data is more suitable for object storage to reduce costs.

In the ecology of Hadoop, native HDFS storage is also an indispensable storage choice in many scenarios, so we also add the storage performance comparison with HDFS below.

Back to the problem we want to solve, let's first take a look at a set of test data. Based on the Spark-2.x engine, we use SparkSQL to write 5000 files to HDFS and COS respectively, and count the execution time respectively:

From the test results, it can be seen that writing to object storage takes 29 times longer than writing to HDFS, and the performance of writing to object storage is much worse than writing to HDFS. We observe the data writing process and find that the network IO is not a bottleneck, so we need to deeply analyze the specific process of computing engine data output.

2. Analysis of Spark data output process 1. Spark data flow

First, use the following figure to understand the main process of data flow during the execution of Spark jobs:

First, each task writes the result data to the temporary directory _ temporary/task_ [id] of the underlying file system, as shown in the schematic diagram of the directory result:

At this point, the task work on executor is actually finished. Next, you will hand over to driver to move these result data files to the location directory where the hive table finally resides. The operation consists of three steps:

The first step is to call the commitJob method of OutputCommiter to rollover and merge temporary files:

As you can see from the diagram above, commitJob merge all the data files under the task_ [id] subdirectory to the upper directory ext-10000.

Next, if overwrite overrides the write data mode, the data already in the table or partition will be moved to the trash Recycle Bin first.

After completing the above operation, the merged data files in the first step, move to the location of the hive table, will be merged, and all data operations will be completed.

two。 Root cause of positioning analysis

With the above analysis of Spark data flow, it is now necessary to locate the performance bottleneck on the driver side or the executor side? Observe how long the job takes on executor:

It is found that there is little difference in the execution time of the job on the executor side, but there is a great difference in the total time spent, which indicates that the job mainly consumes time on the driver side.

There are three operation stages on the driver side: commitJob, trashFiles and moveFiles. Which stages of driver take a long time?

We observe the Thread dump through spark-ui (here, refresh the spark-ui manually or log in to the driver node to view thread stack information using the jstack command). We find that all three stages are slow. Let's analyze the source code of these three parts.

3. Source code analysis (1) JobCommit phase

Spark uses Hadoop's FileOutputCommitter to handle file merge operations, Hadoop 2.x uses mapreduce.fileoutputcommitter.algorithm.version=1 by default, uses a single-threaded for loop to traverse all task subdirectories, and then does the merge path operation, which is obviously very time-consuming in many cases of the output file.

Especially for object storage, rename operations are not just about modifying metadata, but also about going to copy data to new files.

(2) TrashFiles stage

The trashFiles operation is a single-threaded for loop to move the file to the Recycle Bin. If more data needs to be overwritten, this step will be very slow.

(3) MoveFiles stage

Similar to the previous problem, a single-threaded for loop is used to move files in the moveFiles phase.

4. Summary of the problem

The bottleneck of Spark engine writing massive files is on the driver side.

It takes a long time to execute in the CommitJob, TrashFiles and MoveFiles phases of Driver.

The reason why the three phases take a long time is that the single thread loop processes files one by one.

The rename performance of object storage needs to copy data, and the performance is poor, which leads to a long time to write massive files.

III. Optimization results

You can see that the community version of big data computing engine still has some performance problems in dealing with access to object storage. The main reason is that most data platforms are based on HDFS storage, while the rename of HDFS files only needs to modify metadata on namenode. This operation is very fast and does not easily encounter performance bottlenecks.

At present, the separation of cloud and deposit on data is an important consideration for enterprises to reduce costs, so we try to modify the commitJob, trashFiles and moveFile code into multi-thread parallel processing files to improve the performance of file writing operations.

Based on the same benchmark, using SparkSQL to write 5000 files to HDFS and COS respectively, we get the optimized results as shown in the following figure:

Finally, the performance of write HDFS is improved by 41%, and the performance of write object storage is improved by 110%!

About how the EMR Spark engine can improve the write performance by more than 10 times under the separation of deposit and calculation is shared here. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.