How to understand the efficient storage management technology of Spark DataFrame based on Alluxio system 04/10 Update SLTechnology News&Howtos

How to understand the efficient storage management technology of Spark DataFrame based on Alluxio system

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to understand the Spark DataFrame efficient storage management technology based on Alluxio system. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Introduction

More and more companies and organizations begin to deploy Alluxio and Spark together to simplify data management and improve data access performance. Qunar recently deployed Alluxio in their production environment, increasing the average performance of Spark streaming jobs by 15 times and peaking by about 300 times. Before using Alluxio, they found that some Spark jobs in a production environment became slow or even could not be completed. After adopting Alluxio, these tasks can be completed quickly. We will show you how to use Alluxio to help Spark become more efficient. Specifically, we will show you how to use Alluxio to store Spark DataFrame efficiently.

Alluxio and Spark cach

It is very simple for users to use Alluxio to store Spark DataFrame: write DataFrame as a file to Alluxio through Spark DataFrame write API. The usual practice is to use df.write.parquet () to write the DataFrame into a parquet file. After the parquet file corresponding to DataFrame is written to Alluxio, it can be read in Spark using sqlContext.read.parquet (). In order to analyze and understand the performance differences between using Alluxio to store DataFrame and using Spark's built-in cache to store DataFrame, we did the following experiments.

The relevant settings for the experiment are as follows:

Hardware configuration: a single worker is installed on one node, node configuration: 61 GB memory + 8-core CPU

Software version: Spark 2.0.0 and Alluxio1.2.0, parameters are default configuration

How it works: run Spark and Alluxio in standalone mode.

In this experiment, we use Spark built-in different cache levels of storage DataFrame to compare tests using Alluxio to store DataFrame, and then collect and analyze performance test results. At the same time, change the size of the DataFrame to show the impact of the size of the stored DataFrame on performance.

Storage DataFrame

Spark DataFrame can be stored in the Spark cache using persist () API. Persist () can cache DataFrame data to different storage media.

The following Spark cache storage levels (StorageLevel) are used in this lab:

MEMORY_ONLY: storing DataFrame objects in Spark JVM memory

MEMORY_ONLY_SER: stores serialized DataFrame objects in Spark JVM memory

DISK_ONLY: storing DataFrame data on local disk

Here is an example of how to cache DataFrame using persist () API:

Df.persist (MEMORY_ONLY)

Another way to keep DataFrame in memory is to write DataFrame as a file to Alluxio. Spark supports writing DataFrame into many different file formats. In this experiment, we will write DataFrame as a parquet file.

Here is an example of writing DataFrame to Alluxio:

Query DataFrame stored on Alluxio

After the DataFrame is saved (whether stored in Spark memory or Alluxio), the application can read the DataFrame for subsequent computing tasks. In this experiment, we create a DataFrame with two columns (both columns have floating-point data types), and the calculation task is to calculate the sum of the two columns of data respectively.

When DataFrame is stored in Alluxio, Spark reads the DataFrame as easily as it reads a file from Alluxio. Here is an example of reading DataFrame from Alluxio:

Df = sqlContext.read.parquet (alluxioFile)

Df.agg (sum ("S1"), sum ("S2"). Show ()

We read the DataFrame from the parquet file in Alluxio and from various Spark storage level caches, and do the above aggregate calculation. The following figure shows the completion time of aggregation operations in different storage scenarios.

As can be seen from the above figure, reading DataFrame from Alluxio for aggregation operation has relatively stable execution performance. For reading DataFrame from the Spark cache, performance has some advantages when the size of the DataFrame is small, but as the size of the DataFrame increases, performance degrades sharply. In the experimental environment of this paper, for all kinds of Spark built-in storage levels, when the scale of DataFrame reaches 20 GB, the performance of aggregation operation degrades obviously.

On the other hand, compared to using Spark built-in caching, using Alluxio to store DataFrame and aggregate operations, its performance is slightly inferior to small-scale data. However, as the size of DataFrame data grows, the performance of reading DataFrame from Alluxio is better, because the time consuming to read DataFrame from Alluxio almost always increases linearly with the size of the data. Because the read and write performance of using Alluxio storage DataFrame has good linear scalability, the upper layer applications can stably process larger scale data at memory speed.

DataFrame using Alluxio shared storage

Another advantage of using Alluxio to store DataFrame is that data stored in Alluxio can be shared between different Spark applications or jobs. When a DataFrame file is written to Alluxio, it can be shared by different jobs, SparkContext, or even different computing frameworks. Therefore, if a DataFrame stored in Alluxio is frequently accessed by multiple applications, all applications can read data directly from Alluxio memory without recalculating or reading data from other underlying external data sources.

In order to verify the advantages of using Alluxio shared memory, we carried out the same scale DataFrame aggregation operation in the same experimental environment as above. When using a 50 GB scale DataFrame, we aggregate the operation in a single Spark application and record the time spent on the aggregation operation. Without Alluxio, the Spark application needs to read data from the data source every time (a local SSD in this lab). When using Alluxio, the data can be read directly from Alluxio memory. The following figure shows the performance comparison of the completion time of the two aggregation operations. With Alluxio, the aggregation operation is about two and a half times faster.

In the experiment in the figure above, the data source is the local SSD. If DataFrame comes from a data source that is slower or unstable to access, the advantage of Alluxio is even more obvious. For example, the following figure shows the experimental results of replacing a DataFrame data source from a local SSD with a public cloud storage.

This figure shows the average completion time of seven aggregation operations. The red error range (error bar) in the figure represents the maximum and minimum range of completion time. These results clearly show that Alluxio can significantly improve the average performance of operations. This is because when using Alluxio to cache DataFrame, Spark can read DataFrame directly from Alluxio memory rather than from remote public cloud storage. On average, Alluxio can accelerate the aggregation performance of the above DataFrame by more than 10 times.

On the other hand, because the data source is a public cloud system, Spark must read data remotely across the network. Complex network conditions can make read performance difficult to predict. The instability of this performance can be clearly seen in the error range (error bar) in the figure above. Without Alluxio, the completion time of a Spark job ranges from more than 1100 seconds. When using Alluxio, the range of completion time is only 10 seconds. In this experiment, Alluxio can reduce the instability caused by data reading by more than 100 times.

Due to the unpredictability of the network access performance of the common cloud storage system, the slowest Spark job execution time is more than 1700 seconds, two times slower than the average. However, when using Alluxio, the slowest Spark job execution time is only about 6 seconds slower than the average. Therefore, if measured in terms of the slowest Spark job execution time, Alluxio can speed up DataFrame aggregation operations by more than 17 times.

How to understand the Spark DataFrame efficient storage management technology based on Alluxio system is shared here. I hope the above content can be helpful to everyone and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.