What are the hardware requirements for Spark tuning 07/12 Update SLTechnology News&Howtos

What are the hardware requirements for Spark tuning

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the hardware requirements for Spark tuning. Xiaobian thinks it is quite practical, so share it with you for a reference. I hope you can gain something after reading this article.

1. Storage system

Because most Spark jobs likely require reading input data from an external storage system (such as the Hadoop file system or HBase), it's important to keep it as close to that system as possible. Therefore, the following recommendations are made:

If possible, run Spark on the same node as HDFS. The simplest way is to install the Standalone cluster of Spark and the Hadoop cluster on the same node, and configure the memory usage of Spark and Hadoop to avoid mutual interference (for Hadoop, the memory configuration parameters of each task are mapred.child.java.opts; mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum determine the number of tasks). Hadoop and spark can also be run on a common cluster manager, such as mesos and yarn.

2. If this is not possible, run Spark on a different node in the same local area network as HDFS.

For low-latency data storage (such as HBase), it may be preferable to run compute tasks on nodes different from the storage system to avoid interference.

Second, local disk

While Spark can perform a lot of calculations in memory, it still uses local disks to store data that doesn't fit in RAM, as well as intermediate results between stages, i.e. shuffles. We recommend at least 4-8 disks per node, and no RAID is required, just independent disks hanging on the node. In Linux, install disks using the noatime option to reduce unnecessary writes. In the spark task, the spark.local.dir configuration can have more than a dozen disk directories separated by commas. If you're running on hdfs, it's good to be consistent with hdfs.

Installing disks using the noatime option requires that when mounting a file system, you can specify the standard Linux installation option (noatime), which disables atime updates on that file system. Disk hanging command:

mount -t gfs BlockDevice MountPoint -o noatime

BlockDevice Specifies the block device on which the GFS file system resides.

MountPoint Specifies the directory where the GFS file system should be mounted.

Examples:

mount -t gfs /dev/vg01/lvol0 /gfs1 -o noatime

Third, memory.

From 8GB to hundreds of GB of memory on a single machine, spark works well. In all cases, we recommend allocating up to 75% of memory only for Spark; leaving the rest of the OS and buffer cache.

How much memory you need depends on your application. To determine how much memory your app needs for a particular dataset, load a portion of the dataset into memory, and then look at its memory footprint in the Storage interface of the Spark UI.

Note that memory usage is greatly affected by storage levels and serialization formats-see another tuning article for tips on how to reduce memory usage.

Finally, note that JAVA VM health does not always perform well for machines with more than 200GB of memory. If you buy a machine with more than 200GB of memory, you can run multiple workers on a single node. In Spark Standalone mode, you can set the number of single-node workers by setting the SPARK_WORKER_INSTANCES value in the configuration file conf/spark-env.sh. You can also set the SPARK_WORKER_CORES parameter to set the number of CPUs per Worker.

IV. Network

As a rule of thumb, if the data is in memory, then the bottleneck for spark applications is often in the network. Using 10 Gigabit or higher networks is the best way to make spark applications run faster. This is especially true for "distributed reduce" applications such as group-bys,reduce-bys and sql joins. In any given application, you can see how much data the spark shuffle process is transferring over the network via spark ui.

V, CPU

Even with dozens of CPUs per machine, spark scales well because it performs minimal shared CPUs between threads. You should configure at least 8-16 cores per machine. Depending on CPU load, more CPUs may be needed: once the data is in memory, the bottleneck for most applications is in CPU and memory.

About "Spark tuning hardware requirements" this article is shared here, I hope the above content can be of some help to everyone, so that you can learn more knowledge, if you think the article is good, please share it to let more people see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.