In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces to you what are the five questions about Spark torture, the content is very detailed, interested friends can refer to, hope to be helpful to you.
1.Spark computing depends on memory. What if you currently have only 10g of memory, but you need to sort and output 500g files?
①, divide the 500G data on the disk into 100pieces (chunks), each 5GB. (pay attention, leave some system space! )
②, read each 5GB data into memory sequentially, and sort it using quick sort algorithm.
③, and store the sorted data (also known as 5GB) back to disk.
④, loop 100 times, and now all 100 blocks have been sorted separately. The rest of the job is how to merge and sort them! )
⑤, read 5G/100=0.05 G from 100blocks into memory (100input buffers).
⑥, perform a 100-way merge, and temporarily store the merge results in a 5g memory-based output buffer. When the buffer is full of 5GB, the final file on the hard disk is written, and the output buffer is emptied; when any one of the 100 input buffers is processed, the next 0.05GB in the block corresponding to the buffer is written until all processing is completed.
The difference between 2.countByValue and countByKey
First of all, from the perspective of source code:
/ / PairRDDFunctions.scala
Def countByKey (): Map [K, Long] = self.withScope {
Self.mapValues (_ = > 1L). ReduceByKey (_ + _). Collect (). ToMap
}
/ / RDD.scala
Def countByValue () (implicit ord: Ordering [T] = null): Map [T, Long] = withScope {
Map (value = > (value, null). CountByKey ()
}
CountByValue (RDD.scala)
Acting on ordinary RDD
The implementation process calls countByKey.
CountByKey (PairRDDFunctions.scala)
Acting on PairRDD
Count the key
If the data is to be received to the driver, it is not applicable when the result set is large
Question:
Can countByKey be used on ordinary RDD?
Can countByValue act on PairRDD?
Val rdd1: RDD [Int] = sc.makeRDD (1 to 10)
Val rdd2: RDD [(Int, Int)] = sc.makeRDD ((1 to 10) .toList.zipWithIndex)
Val result1 = rdd1.countByValue () / / OK
Val result2 = rdd1.countByKey () / / syntax error
Val result3 = rdd2.countByValue () / / OK
Val result4 = rdd2.countByKey () / / OK
3. When do two rdd join have shuffle and when do not have shuffle
Among them, join operation is an important index to test the performance of all databases. For Spark, testing the performance of join is that Shuffle,Shuffle needs to be transmitted through disk and network. The less Shuffle data, the better performance. Sometimes programs can avoid Shuffle as much as possible. Under what circumstances is there Shuffle, and when is there no Shuffle?
3.1 Broadcast join
Broadcast join is easy to understand. In addition to our own implementation, Spark SQL has helped us to implement it by default, that is, small tables are distributed to all Executors. The control parameter is: the default size of spark.sql.autoBroadcastJoinThreshold is 10m, that is, if it is less than this threshold, broadcast join is automatically used.
3.2 Bucket join
In fact, the way of rdd is similar to that of table, except that the latter has to write the Bucket table. Here we mainly talk about the way of rdd. The principle is that when two rdd are partitioned in advance according to the same partition mode, the partition results are consistent, so that Bucket join can be carried out. In addition, this kind of join has no pre-operator, so you need to develop it yourself when you write the program. For this join of a table, take a look at the core optimization practice of byte jumps on Spark SQL. You can see the following example
Rdd1 and rdd2 are all Pair RDD.
The data of rdd1 and rdd2 are exactly the same
There must be a shuffle.
Rdd1 = > 5 partitions
Rdd2 = > 6 partitions
Rdd1 = > 5 partitions = > (1, 0), (2), | (1, 0), (2), | | (1, 0), (2), (1, 0), (1, 0), | | (2, 0), (1, 0), (2)
Rdd2 = > 5 partitions = > (1, 0), (2), | (1, 0), (2), | | (1, 0), (2), (1, 0), (1, 0), | | (2, 0), (1, 0), (2)
There must be no shuffle
Rdd1 = > 5 zones = > (1jue 0), (1je 0), (1jue 0), (1je 0), (1je 0), (1je 0) | | (2je 0), (2je 0), (2je 0), (2je 0), (2je 0), (2je 0) | | empty |
Rdd2 = > 5 zones = > (1jue 0), (1je 0), (1jue 0), (1je 0), (1je 0), (1je 0) | | (2je 0), (2je 0), (2je 0), (2je 0), (2je 0), (2je 0) | | empty |
In this way, all Shuffle operators, if the data is partitionBy in advance, in many cases there is no Shuffle.
In addition to the above two ways, there is generally a join with Shuffle. The join principle of spark can be seen: big data Development-detailed explanation of Spark Join principle
Does 4..transform necessarily not trigger action?
There is an exception to the operator, and that is sortByKey, which has a sampling algorithm at the bottom, pond sampling, and finally needs to RangePartition according to the sampling results, so from the job point of view, you will see two job, in addition to triggering the action itself operator, remember the following
SortByKey → Reservoir sampling → collect
5. How are broadcast variables designed?
We all know that the broadcast variable puts the data on each excutor, and we all know that the data of the broadcast variable must start from driver. What does it mean? if the broadcast table is placed in the hive table, then its storage is on each block block and corresponds to multiple excutor (different names). First, the data is pulled to the driver, and then broadcast. Not all of the broadcast is broadcast, but the data is pre-used according to the excutor. First take the data, and then transmit it through the bt protocol, what is the bt protocol, that is, the data pulls the corresponding data back and forth on the distributed point-to-point network according to the network distance, and the downloader is also the uploader, so it is different for every task (excutor) to pull the data from driver, which reduces the pressure. In addition, in spark1. When it was task, it is now a common lock, and the task on the entire excutor shares this data.
What are the five questions about Spark torture are shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.