What are the questions that Spark- tortures the soul in big data's development? 07/11 Update SLTechnology News&Howtos

What are the questions that Spark- tortures the soul in big data's development?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

What are the problems of Spark-torturing the soul in big data development? Many novices are not very clear about this. In order to help you solve this problem, the following small series will explain it in detail for everyone. Those who have this need can learn it. I hope you can gain something.

1. Spark calculation depends on memory. If there is only 10g of memory at present, but you need to sort and output 500G files, how do you need to operate? Divide the 500 gigabytes of data on the disk into 100 chunks of 5 gigabytes each. (Note that some system space must be left!)

2. Read each 5GB data into memory in sequence and sort it using quick sort algorithm.

3. Store the sorted data (also 5GB) back to disk.

Loop 100 times, now all 100 blocks have been sorted individually. (All that remains is to sort them together!)

5. Read 5G/100= 0.05G from 100 blocks into memory (100input buffers).

Perform a 100-way merge and temporarily store the merge result in a 5g memory-based output buffer. When the buffer is full of 5GB, write the final file on the hard disk and empty the output buffer; when any of the 100 input buffers is processed, write the next 0.05GB in the block corresponding to the buffer until all processing is completed.

Difference between countByValue and countByKey

First of all, from the source code point of view:

// PairRDDFunctions.scaladef countByKey(): Map[K, Long] = self.withScope { self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap}// RDD.scaladef countByValue()(implicit ord: Ordering[T] = null): Map[T, Long] = withScope { map(value => (value, null)).countByKey()}

countByValue（RDD.scala）

on a regular RDD.

Its implementation calls countByKey

countByKey（PairRDDFunctions.scala）

Action on PairRDD

Counting key

Data to be received at Driver end, not applicable when result set is large

Question:

Can countByKey work on regular RDD?

Can countByValue work on PairRDD?

val rdd1: RDD[Int] = sc.makeRDD(1 to 10)val rdd2: RDD[(Int, Int)] = sc.makeRDD((1 to 10).toList.zipWithIndex)val result1 = rdd1.countByValue() //may val result2 = rdd1.countByKey() //syntax error val result3 = rdd2.countByValue() //may val result4 = rdd2.countByKey() //may 3. two rdd join when there is a shuffle when there is no shuffle

The join operation is an important indicator to test the performance of all databases. For Spark, the performance of the join is Shuffle,Shuffle needs to be transmitted through disk and network. The less Shuffle data, the better performance. Sometimes you can try to avoid the program Shuffle. So under what circumstances does Shuffle exist and under what circumstances does Shuffle not exist?

3.1 Broadcast join

Broadcast join is easy to understand, in addition to their own implementation, Spark SQL has helped us to implement the default, in fact, small tables distributed to all Executives, control parameters are: spark.sql.autoBroadcastJoinThreshold default size is 10m, that is, less than this threshold automatically use broadcast join.

3.2 Bucket join

In fact, rdd mode is similar to table, the difference is that the latter needs to be written into Bucket table. Here, we mainly talk about rdd mode. The principle is that when two rdd are partitioned in advance according to the same partition mode, the partition results are consistent, so Bucket join can be carried out. In addition, this join has no pre-operator, and needs to be developed by itself when writing programs. For this join of tables, you can take a look at the core optimization practice of ByteDance (ByteDance) in Spark SQL. Consider the following example

rdd1 and rdd2 are Pair RDD

RDD1 and RDD2 are identical.

There must be a shuffle.

rdd1 => 5 partitions

rdd2 => 6 partitions

rdd1 => 5 partitions => (1, 0), (2, 0),|| (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0),(1, 0), || (2,0),(1, 0), (2,0)

rdd2 => 5 partitions => (1, 0), (2,0),|| (1, 0), (2,0), || (1, 0), (2,0), || (1, 0), (2,0),(1, 0), || (2,0),(1, 0), (2,0)

There must be no shuffle.

rdd1 => 5 partitions =>(1,0),(1,0),(1,0),(1,0),(1,0),|| (2,0), (2,0), (2,0), (2,0), (2,0), (2,0), (2,0) ||empty|| empty|| empty

rdd2 => 5 partitions =>(1,0),(1,0),(1,0),(1,0),(1,0),|| (2,0), (2,0), (2,0), (2,0), (2,0), (2,0), (2,0) ||empty|| empty|| empty

In this way, all Shuffle operators, if the data is partitioned in advance (partitionBy), in many cases there is no Shuffle.

In addition to the above two ways, generally there is a Shuffle join, about the join principle of spark can be viewed: big data development-Spark Join principle detailed explanation

4.. Transform does not trigger action.

There is an exception to the operator, that is sortByKey, which has a sampling algorithm at the bottom, pool sampling, and finally needs to perform RangePartition according to the sampling results. So from the job point of view, you will see two jobs, except for the operator itself that triggers the action. Remember the following

sortByKey → pond sampling → collect

5. How are broadcast variables designed?

We all know that broadcast variables are to put data on each excitor, and we also know that the data of broadcast variables must start from the driver. What does it mean? If the broadcast table is placed in the hive table, then its storage is on each block, and it also corresponds to multiple excitors (different names). First, the data is pulled to the driver, and then broadcast. When broadcasting, it is not all broadcast, but the data is used in advance according to the excitor. First, the data is taken, and then transmitted through the bt protocol. What is bt protocol, that is, data in distributed peer-to-peer network, according to the network distance to pull the corresponding data, downloaders are also uploaders, so that each task (excitor) pulls data from the driver, so as to reduce the pressure, in addition to spark1. When it is still task level, it is now a common lock, the task on the entire excitor shares this data.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.