Case Analysis of spark data Localization 04/18 Update SLTechnology News&Howtos

Case Analysis of spark data Localization

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "local case analysis of spark data". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Scene:

Spark on Driver, before allocating the task of each stage of Application, it will calculate which shard data is to be calculated by each task. The task allocation algorithm of a certain partition;Spark of RDD has priority. It will want each task to be allocated exactly to the node where the data to be calculated. In this way, there is no need to transfer data between networks. But, generally speaking, sometimes, contrary to one's wishes, task may not have the opportunity to assign to the node where its data is located. Why? maybe that node has full computing resources and computing power. So, in this kind of time, generally speaking, Spark will wait for a period of time. By default, it is 3s clock (not absolute, there are many cases, different waiting time can be set for different localization levels), default retry 5 times, in the end, really can not wait, will choose a relatively poor localization level, for example, assign task to the node where the data is calculated by it. A node closer to it, and then calculated.

But for the second case, generally speaking, data transmission must occur. Task will obtain data through the BlockManager of its node. BlockManager will find that it does not have data locally and will use a getRemote () method to obtain data from the BlockManager of the node where the data is located through TransferService (network data transmission component), and transmit it back to the node where task is located through the network.

For us, of course, we don't want to be similar to the second case. The best thing, of course, is that task and data are on the same node, getting data directly from the BlockManager of the local executor, pure memory, or with a little disk IO;. If you want to transfer data over the network, then the performance will definitely degrade. A large number of network transfers, as well as disk IO, are performance killers.

If you can get the data from the location where the data is located, that is the best case. If the machine resources where the data is located are occupied for more than 3 seconds, it will be placed on other machines close to the data. Then the Task task will ask its own local BlockManager for data. If not, it will manage the nearby BlockManager through BlockManager, which is the important data of the machine where the data resides. It may not be in a node. If you want to use the network transmission, of course, if you say that the two executor are both in the same node, then this situation is not bad. It is on the same node, and the data can be transferred between paths.

In another case, the worst is this way of pulling data across racks. The speed is very slow, which has a great impact on performance.

What are the levels of data localization in spark?

PROCESS_LOCAL: process localization, with code and data in the same process, that is, in the same executor; the task of computing data is executed by executor, and the data performs best in executor's BlockManager.

NODE_LOCAL: the node is localized and the code and data are in the same node; for example, the data is on the node as a HDFS block block, and the task runs in an executor on the node; or, the data and the task are in different executor on the same node, and the data needs to be transferred between processes.

NO_PREF: for task, getting data from anywhere is the same, and there is no difference between good and bad, such as getting data from a database

RACK_LOCAL: rack localization, data and task are on two nodes of the same rack, and data needs to be transferred between nodes over the network.

ANY: data and task can be anywhere in the cluster, and not in the same rack, with the worst performance.

Spark.locality.wait, default is 3s

When are we going to adjust this parameter?

Observe the log, the running log of the spark job. It is recommended that you first use the client mode when testing. You can directly see the complete log locally. The log will show, starting task. PROCESS LOCAL, NODE LOCAL observe the data localization level of most task.

If most of them are PROCESS_LOCAL, then there is no need to adjust them. If it is found that many levels are RACK_LOCAL or ANY, then it is best to adjust the waiting time for data localization. It should be adjusted repeatedly. After each adjustment, run again and again. Observe the log to see if the localization level of most of the task has been improved. Let's see if the running time of the whole spark job has been shortened. Don't put the cart before the horse. The localization level has been raised, but because of a large amount of waiting time, the running time of the spark job has increased, so don't adjust it.

How to adjust?

Spark.locality.wait, default is 3sbetting 6srecovery10s. By default, the waiting time of the following three is the same as the one above. They are all 3sspark.locality.wait.processspark.locality.wait.nodespark.locality.wait.racknew SparkConf (). Set ("spark.locality.wait", "10") "spark data locality instance Analysis". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.