Spark cluster hardware configuration recommendation 04/05 Update SLTechnology News&Howtos

Spark cluster hardware configuration recommendation

2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Computing and storage are recommended for Spark cluster hardware configuration:

Most Spark jobs may need to read input data from an external storage system such as Cassandra, Hadoop file system, or HBase, so keep the Spark computing engine as close to the data persistence layer as possible.

If you use HDFS as the data storage cluster, you can deploy the Spark cluster on the same cluster and configure the memory and CPU utilization of Spark and Hadoop to avoid interference. Our production storage uses Cassandra cluster, spark master service is deployed separately, and other nodes are deployed at the same time: Cassandra + spark worker to ensure that spark worker nodes can quickly read data locally for calculation and summary.

Disk:

Although Spark can perform a large number of calculations in memory, it may still use local disks to store data that is not suitable for RAM. It is recommended that each node be configured with 4-8 disks without the need to configure RAID (disk array). Disk costs are getting lower and lower, so you can consider configuring ssd hard drives, which can greatly improve performance. In addition, in Linux, mount the disk using the noatime option to reduce unnecessary write operations. In Spark, the spark.local.dir variable can be configured as the address of multiple local disks, separated by commas.

Memory

It is recommended that the amount of memory allocated for Spark is no more than 75% of the total memory capacity of the machine; be sure to leave enough memory for the operating system and buffers. Assess how much memory is needed based on business characteristics.

Note that the performance of the Java virtual machine is unstable when the memory capacity exceeds the 200GB. If the RAM you purchased is greater than 200G, you can run multiple worker JVM for each node. In Spark's standalone mode, you can set the number of worker processes running per node through the SPARK_WORKER_INSTANCES variable in conf/spark-env.sh, and the number of cpu cores available for each worker through the SPARK_WORKER_CORES variable.

The network

When data is already stored in memory, the performance bottleneck of many Spark applications is the transfer rate of the network. Networks with a minimum of 10G are recommended.

CPU

There are many tasks for Spark to run summary calculation. It is recommended to configure more cpu cores. The performance improvement is still obvious. It is recommended to configure at least 8-16 cores per machine. The configuration can be adjusted according to the CPU load of the Spark job. Once the data is in memory, the performance bottleneck for most applications lies in CPU and the network.

Reference documentation

Http://spark.apache.org/docs/latest/hardware-provisioning.html

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.