Basic introduction and Operation tuning of Spark 09/22 Update SLTechnology News&Howtos

Basic introduction and Operation tuning of Spark

2025-09-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "basic introduction and operation tuning of Spark". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Basic introduction of Spark

Before we discuss spark tuning, let's take a look at some concepts in spark.

Action

Action is a RDD operation that gets non-RDD results. For example, the following common action operations are found in Spark: reduce, collect, count, first, take, takeSample, countByKey, saveAsTextFile

Job

The action of each spark is broken down into a job.

Stage

A job is divided into groups of task, and each group of task is called a stage. Stage is demarcated by one of the following two types of task:

ShuffleMapTask-before all the wide transformation, you can simply think of it as before shuffle.

ResultTask-you can simply think of it as an operation like take ()

Partition

RDD contains a fixed number of partition, and each partiton contains several record.

The RDD returned by narrow tansformation (such as map and filter). The record in a partition only needs to be calculated from the record in the partition corresponding to the parent RDD. Similarly, narrow transformation does not change the number of partition.

Task

A unit of work that is sent to the executor for execution; a task can only do data for one partition in a stage.

Operation optimization

Adjusting the number of partition in the stage can often affect the execution efficiency of the program to a great extent.

Associative reductive operation, can use reduceByKey without using groupByKey, because grouByKey will shuffle all the data, while reduceByKey will only Shuffle reduce the results.

Use aggregateByKey instead of reduceByKey when the input and output results are different

AggregateByKey: Aggregate the values of each key, using given combine functions and a neutral "zero value". This function can return a different result type, U, than the type of the values in this RDD, V. Thus, we need one operation for merging a V into a U and one operation for merging two U's, as in scala.TraversableOnce. The former operation is used for merging values within a partition, and the latter is used for merging values between partitions. To avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.

Don't use flatMap-join-groupBy mode, you can use cogroup

When the result of two reduceByKey is join, if everyone's partition is the same, then spark will not do shuffle in join.

When a dataset join that can fit in memory, you can consider broadcast instead of using join

Scala > val broadcastVar = sc.broadcast (Array (1,2,3)) broadcastVar: org.apache.spark.broadcast.Broadcasting [Array [int]] = Broadcast (0) scala > broadcastVar.valueres0: Array [Int] = Array (1,2,3) Resource tuning

The resources in spark simply boil down to CPU and memory, and the following parameters affect memory and CPU usage.

The larger the executor, the better the parallelism. The larger the executor, the smaller the memory per executor.

Core, the bigger the parallelism, the better

HDFS client is a performance problem when there are a large number of concurrent threads. A rough estimate is that up to five parallel task in each executor can take up the full write bandwidth.

Partition is silly if it is smaller than excutor*core; the more it is, the less memory each partition takes up; and when it is large enough, it is no longer useful for performance improvement.

My naive thinks that it should be adjusted as follows:

Core = min (5 CPU cores)

Executor = instance * number of cpu cores / core

The average number of executor per instance determines executor.memory, which in turn determines shuffle.memory and storage.memory

Estimate the total amount of data, that is, the data size at the largest shuffle (there will be shuffle size in the spark driver running record)

Divide the result of 4 by 3 to get the partition number. If it is very small, set partition to several times the sum (executor*core).

This is the end of the content of "basic introduction and Operation tuning of Spark". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.