How to analyze the storage of Spark in action on Kubernetes 07/06 Update SLTechnology News&Howtos

How to analyze the storage of Spark in action on Kubernetes

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to analyze the storage of Spark in action on Kubernetes. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something according to this article.

Preface

Today we will discuss one of the most important topics in big data's field-storage.

Big data has been silently integrated into everyone's life. From buying a house for tourism to taking a taxi, you can see the use of data analysis, data recommendation and data decision-making provided by big data. If big data wants to be able to assist decision-making more accurately, he needs to have higher requirements in the aspects of multi-dimension and completeness of data.

Predictably in the future, the order of magnitude of data will become larger and larger, especially with the arrival of the 5G era, the throughput of data increases exponentially, the dimensions and sources of data will become more and more, and the types of data will become more and more heterogeneous. it also brings new challenges to big data platform. Low cost, more storage and fast reading and writing have become the three major problems of big data storage, and today we will discuss these three major problems.

Calculation and Storage of containerized big data

Computing and storage separation have been discussed many times in big data field. Usually, we will look at this issue from the following perspectives:

Hardware limitations: the bandwidth of the machine increases exponentially, but the speed of the disk remains basically the same, resulting in a weakening of the advantage of local data reading and writing.

Computing cost: the magnitude of calculation and storage do not match, which may cause a lot of waste of computing power, and independent computing resources can save costs.

Storage cost: centralized storage can reduce storage costs while ensuring higher SLA, thus reducing the advantages of self-built data warehouses.

These three major problems become more and more prominent with the advent of the container era. We know that in Kubernetes, Pod runs on the underlying resource pool, and the storage needed by Pod is dynamically allocated and mounted through PV or PVC. In a sense, the architecture of the container itself is the separation of computing and storage. So what are the changes and advantages of big data container cluster, which uses the separation of storage and computing?

Lower cost

Usually, when you build a Spark big data platform on Aliyun, you will first select D-series machines, build a series of basic components such as HDFS and Hadoop on it, and then schedule Spark and other job tasks through Yarn to run on this cluster. The range of private network bandwidth of D-Series is 3Gbps-20Gbps, which can be bound to 5.5T local area by default. Because on the cloud, the IO of the cloud disk and the IO of the network are shared, while the local IO is independent, the IO performance of the D-series + local disk is better than that of the traditional model + cloud disk of the same specification.

However, in the actual production, we will find that the stored data is becoming more and more in the face of time, and because the data has a certain timeliness, the computing power per unit time does not match the growth of storage. This time will bring about a waste of cost. So what happens if we use the idea of separation of computing and storage and use external storage, such as OSS, Nas, or DFS (Aliyun HDFS product)?

First of all, we shield the impact caused by the differences in the storage IO, and first use the remote DFS as the file storage. Then we chose ecs.ebmhfg5.2xlarge (8C32G6Gbps) and ecs.d1ne.2xlarge (8C32G6Gbps), two popular models with the same specifications and configurations for computing and big data scenarios, respectively, and compared them.

Test results of ecs.ebmhfg5.2xlarge (8C32G):

Test results of ecs.d1ne.2xlarge (8C32G):

Through Hibench, we can roughly estimate that under the assumption that the performance of IO is basically the same, the computing performance of ecs.ebmhfg5.2xlarge is about 30% higher than that of ecs.d1ne.2xlarge, and the cost of ecs.ebmhfg5.2xlarge is about 25% lower than that of ecs.d1ne.2xlarge.

In other words, if we only look at the computing power, we can choose more efficient and economical models. When storage and computing are separated, we can estimate the required consumption separately from the two dimensions of storage and computing. On the model, more consideration can be given to the high-frequency computing power of ECS, while OSS or DFS can be used in storage, and the storage cost is lower than that of local storage.

In addition, usually the D-series models are the 1:4 CPU memory ratio, as the scene of big data's job becomes more and more abundant, the CPU memory ratio at 1:4 is not exactly the best ratio. When storage and computing are separated, we can choose appropriate computing resources according to the type of business, or even maintain a variety of computing resources in a computing resource pool, thus improving resource utilization.

The SLA of data storage is also completely different from the SLA of computing tasks. Downtime or interruption cannot be tolerated in storage, but for computing tasks, they have already been cut into subtasks, and the exception of a single subtask can only be retried. Further, lower-cost resources such as bidding instances can be used as the runtime environment of computing tasks to further optimize the cost.

In addition, the biggest feature of the container is flexibility. Through the flexibility, the container can obtain tens or even hundreds of times of its own computing resources in a short time, and then automatically release after the completion of the computing task. Currently, Ali Cloud CCS provides autoscaler for node-level elastic scaling, which can scale up to 500 nodes in one and a half minutes. In the traditional computing-storage coupling scenario, storage is a major obstacle to elasticity, and after the separation of storage and computing, we can achieve flexibility for almost stateless computing, and achieve real on-demand use and consumption.

Save more

After using external storage, we can not only achieve almost unlimited storage levels, but also have more choices. At the beginning of this article, we have mentioned the arrival of the era of big data, which will introduce more dimensional and more heterogeneous data. This also brings more challenges to the way and type of data storage.

Simple HDFS, Hbase, Kafka and other data storage and links will not be able to meet our needs. For example, data collected from IoT devices tend to use time series storage for offline, data generated from upstream and downstream of applications are more likely to be stored in structured databases, there will be more and more data sources and links, and the underlying infrastructure and dependencies of big data platform will become more and more. On the cloud, Aliyun provides a variety of storage services to meet all kinds of scenarios handled by big data.

In addition to traditional HDFS, Hbase, kafka, OSS, Nas, CPFS and other storage, it also includes MNS, TSDB, OAS (cold data archiving), and so on. The use of storage services allows big data platform to focus more on business development rather than on the operation and maintenance of the underlying infrastructure. Not only can save more, but also can save better, save more.

Read and write faster

In a way, it is impossible to read and write faster, because an independent local site can be promoted by hanging enough disks in parallel, but the question to note is whether the bottleneck of each subtask is still on disk IO when we cut tasks through MR. In most cases, the answer is no.

The private network bandwidth of the ECS specification we tested above can already reach 6Gbps. If all the network bandwidth is converted into disk IO, the data throughput of this order of magnitude is redundant compared with the computing power of 8C32G, so the faster read and write speed mentioned here refers to the way to improve the read and write speed on the premise of IO redundancy.

OSS is the object storage provided on Ali Cloud. The IO for reading different individual files is parallel, that is, if your business scenario is parallel reading of a large number of small and medium-sized files, such as reading and writing directories in Spark, then the read and write speed of IO is approximately linear. If developers still want to use HDFS, Aliyun also mentioned HDFS storage service, which provides a large number of storage and query optimization, which is about 50% higher than the traditional self-built HDFS.

Storage solution of Ali Cloud Container Service

Ali Cloud CCS meets the needs of big data processing in multiple dimensions and levels. Developers can choose their own storage method according to different business scenarios and IO's new capability metrics.

A large number of small file storage

OSS is the best way to use OSS for this scenario. There are two ways to operate OSS in a container, one is to mount OSS as a file system, and the other is to use SDK directly in Spark.

The first scheme is very inapplicable in big data's scenario, especially in scenarios with a large number of files, if there is no optimization method similar to SmartFS, it will bring great delay and inconsistency. The way to use SDK is very direct and simple, as long as you put the corresponding Jar under CLASSPATH, you can refer to the following code to deal with the file contents in OSS directly.

Package com.aliyun.emr.example

Object OSSSample extends RunLocally {

Def main (args: Array [String]): Unit = {

If (args.length

< 2) { System.err.println( """Usage: bin/spark-submit --class OSSSample examples-1.0-SNAPSHOT-shaded.jar | |Arguments: | | inputPath Input OSS object path, like oss://accessKeyId:accessKeySecret@bucket.endpoint/a/b.txt | numPartitions the number of RDD partitions. | """.stripMargin) System.exit(1) } val inputPath = args(0) val numPartitions = args(1).toInt val ossData = sc.textFile(inputPath, numPartitions) println("The top 10 lines are:") ossData.top(10).foreach(println) } override def getAppName: String = "OSS Sample" } 另外针对 Spark SQL 的场景，阿里云也提供了 https://yq.aliyun.com/articles/593910">

Oss-select is supported, and single file can be retrieved and queried by SparkSQL. Special note: when using Spark Operator for task execution, you need to preset the corresponding Jar package under the CLASSPATH of Driver Pod and Exector Pod.

The OSS approach is mainly for scenarios with a single file under 100 megabytes and a large number of files. Data storage is the cheapest of several common storage, supporting the separation of hot and cold data, mainly for scenarios with more reads and less writes or no writes.

HDFS file storage

A new DFS service has been launched on Aliyun, which can manage and access data like in Hadoop distributed file system (Hadoop Distributed File System). A distributed file system with unlimited capacity and performance extension, single namespace, multi-sharing, high reliability and high availability can be used without any modification to the existing big data analysis application.

DFS service is compatible with HDFS protocol. Developers only need to place the corresponding calling Jar package in the CLASSPATH of Driver Pod and Exector Pod. When calling, you can call it in the following way.

/ * SimpleApp.scala * /

Import org.apache.spark.sql.SparkSession

Object SimpleApp {

Def main (args: Array [String]) {

Val logFile = "dfs://f-5d68cc61ya36.cn-beijing.dfs.aliyuncs.com:10290/logdata/ab.log"

Val spark = SparkSession.builder.appName ("Simple Application") .getOrCreate

Val logData = spark.read.textFile (logFile). Cache ()

Val numAs = logData.filter (line = > line.contains ("a"). Count ()

Val numBs = logData.filter (line = > line.contains ("b"). Count ()

Println (s "Lines with a: $numAs, Lines with b: $numBs")

Spark.stop ()

}

DFS service is mainly for hot data scenarios with high IO read and write, and the price is higher than that of OSS storage, but lower than that of Nas and other structured storage. For developers who are used to HDFS, it is the best solution. Among all the storage solutions, IO has the best performance, and IO is better than local storage in the same capacity scenario.

General file storage

OSS method for some scenarios, data upload and transmission depend on SDK, which may be a little inconvenient to operate. Then Nas is also an alternative. Nas's own protocol is strongly consistent, and developers can read and write data in the same way as local files. The mode of use is as follows:

1. First, create Nas-related storage PV and PVC in the CCS console.

two。 Then declare the PodVolumeClain used in the definition of Spark Operator.

ApiVersion: "sparkoperator.k8s.io/v1alpha1"

Kind: SparkApplication

Metadata:

Name: spark-pi

Namespace: default

Spec:

Type: Scala

Mode: cluster

Image: "gcr.io/spark-operator/spark:v2.4.0"

ImagePullPolicy: Always

MainClass: org.apache.spark.examples.SparkPi

MainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.11-2.4.0.jar"

RestartPolicy:

Type: Never

Volumes:

-name: pvc-nas

PersistentVolumeClaim:

ClaimName: pvc-nas

Driver:

Cores: 0.1

CoreLimit: "200m"

Memory: "512m"

Labels:

Version: 2.4.0

ServiceAccount: spark

VolumeMounts:

-name: "pvc-nas"

MountPath: "/ tmp"

Executor:

Cores: 1

Instances: 1

Memory: "512m"

Labels:

Version: 2.4.0

VolumeMounts:

-name: "pvc-nas"

MountPath: "/ tmp"

Of course, developers who are familiar with Kubernetes can also mount it directly using dynamic storage. The specific document address is as follows:

Https://www.alibabacloud.com/help/zh/doc-detail/88940.htm

Nas storage is less useful in Spark scenarios, mainly because there is a certain gap between IO and HDFS, and the storage price is much more expensive than OSS. However, for scenarios where some data workflow needs to be reused and the requirements of IO are not particularly high, the use of Nas is very simple.

Other storage structures

In the Spark Streaming scenario, we often use things like mns or kafka, and sometimes we use Elasticsearch and Hbase, and so on. These are also supported by corresponding services on Aliyun, and developers can focus more on data development through the integration and use of these cloud services.

After reading the above, do you have any further understanding of how to analyze the storage of Spark in action on Kubernetes? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.