What are the five questions about Spark torture? 04/27 Update SLTechnology News&Howtos

What are the five questions about Spark torture?

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces to you what are the five questions about Spark torture, the content is very detailed, interested friends can refer to, hope to be helpful to you.

1.Spark computing depends on memory. What if you currently have only 10g of memory, but you need to sort and output 500g files?

①, divide the 500G data on the disk into 100pieces (chunks), each 5GB. (pay attention, leave some system space! )

②, read each 5GB data into memory sequentially, and sort it using quick sort algorithm.

③, and store the sorted data (also known as 5GB) back to disk.

④, loop 100 times, and now all 100 blocks have been sorted separately. The rest of the job is how to merge and sort them! )

⑤, read 5G/100=0.05 G from 100blocks into memory (100input buffers).

⑥, perform a 100-way merge, and temporarily store the merge results in a 5g memory-based output buffer. When the buffer is full of 5GB, the final file on the hard disk is written, and the output buffer is emptied; when any one of the 100 input buffers is processed, the next 0.05GB in the block corresponding to the buffer is written until all processing is completed.

The difference between 2.countByValue and countByKey

First of all, from the perspective of source code:

/ / PairRDDFunctions.scala

Def countByKey (): Map [K, Long] = self.withScope {

Self.mapValues (_ = > 1L). ReduceByKey (_ + _). Collect (). ToMap

}

/ / RDD.scala

Def countByValue () (implicit ord: Ordering [T] = null): Map [T, Long] = withScope {

Map (value = > (value, null). CountByKey ()

}

CountByValue (RDD.scala)

Acting on ordinary RDD

The implementation process calls countByKey.

CountByKey (PairRDDFunctions.scala)

Acting on PairRDD

Count the key

If the data is to be received to the driver, it is not applicable when the result set is large

Question:

Can countByKey be used on ordinary RDD?

Can countByValue act on PairRDD?

Val rdd1: RDD [Int] = sc.makeRDD (1 to 10)

Val rdd2: RDD [(Int, Int)] = sc.makeRDD ((1 to 10) .toList.zipWithIndex)

Val result1 = rdd1.countByValue () / / OK

Val result2 = rdd1.countByKey () / / syntax error

Val result3 = rdd2.countByValue () / / OK

Val result4 = rdd2.countByKey () / / OK

3. When do two rdd join have shuffle and when do not have shuffle

Among them, join operation is an important index to test the performance of all databases. For Spark, testing the performance of join is that Shuffle,Shuffle needs to be transmitted through disk and network. The less Shuffle data, the better performance. Sometimes programs can avoid Shuffle as much as possible. Under what circumstances is there Shuffle, and when is there no Shuffle?

3.1 Broadcast join

Broadcast join is easy to understand. In addition to our own implementation, Spark SQL has helped us to implement it by default, that is, small tables are distributed to all Executors. The control parameter is: the default size of spark.sql.autoBroadcastJoinThreshold is 10m, that is, if it is less than this threshold, broadcast join is automatically used.

3.2 Bucket join

In fact, the way of rdd is similar to that of table, except that the latter has to write the Bucket table. Here we mainly talk about the way of rdd. The principle is that when two rdd are partitioned in advance according to the same partition mode, the partition results are consistent, so that Bucket join can be carried out. In addition, this kind of join has no pre-operator, so you need to develop it yourself when you write the program. For this join of a table, take a look at the core optimization practice of byte jumps on Spark SQL. You can see the following example

Rdd1 and rdd2 are all Pair RDD.

The data of rdd1 and rdd2 are exactly the same

There must be a shuffle.

Rdd1 = > 5 partitions

Rdd2 = > 6 partitions

Rdd1 = > 5 partitions = > (1, 0), (2), | (1, 0), (2), | | (1, 0), (2), (1, 0), (1, 0), | | (2, 0), (1, 0), (2)

Rdd2 = > 5 partitions = > (1, 0), (2), | (1, 0), (2), | | (1, 0), (2), (1, 0), (1, 0), | | (2, 0), (1, 0), (2)

There must be no shuffle

Rdd1 = > 5 zones = > (1jue 0), (1je 0), (1jue 0), (1je 0), (1je 0), (1je 0) | | (2je 0), (2je 0), (2je 0), (2je 0), (2je 0), (2je 0) | | empty |

Rdd2 = > 5 zones = > (1jue 0), (1je 0), (1jue 0), (1je 0), (1je 0), (1je 0) | | (2je 0), (2je 0), (2je 0), (2je 0), (2je 0), (2je 0) | | empty |

In this way, all Shuffle operators, if the data is partitionBy in advance, in many cases there is no Shuffle.

In addition to the above two ways, there is generally a join with Shuffle. The join principle of spark can be seen: big data Development-detailed explanation of Spark Join principle

Does 4..transform necessarily not trigger action?

There is an exception to the operator, and that is sortByKey, which has a sampling algorithm at the bottom, pond sampling, and finally needs to RangePartition according to the sampling results, so from the job point of view, you will see two job, in addition to triggering the action itself operator, remember the following

SortByKey → Reservoir sampling → collect

5. How are broadcast variables designed?

We all know that the broadcast variable puts the data on each excutor, and we all know that the data of the broadcast variable must start from driver. What does it mean? if the broadcast table is placed in the hive table, then its storage is on each block block and corresponds to multiple excutor (different names). First, the data is pulled to the driver, and then broadcast. Not all of the broadcast is broadcast, but the data is pre-used according to the excutor. First take the data, and then transmit it through the bt protocol, what is the bt protocol, that is, the data pulls the corresponding data back and forth on the distributed point-to-point network according to the network distance, and the downloader is also the uploader, so it is different for every task (excutor) to pull the data from driver, which reduces the pressure. In addition, in spark1. When it was task, it is now a common lock, and the task on the entire excutor shares this data.

What are the five questions about Spark torture are shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.