What are the hardware requirements of spark 07/19 Update SLTechnology News&Howtos

What are the hardware requirements of spark

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Editor to share with you what the hardware requirements of spark, I believe most people do not know much, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

First, storage system

Because most Spark work may require reading input data from an external storage system, such as the Hadoop file system or HBase, it is important to deploy spark as close to the storage system as possible. Therefore, there are the following suggestions:

1, if possible, run Spark on the same node as HDFS. The easiest way is to install spark's Standalone cluster and hadoop cluster on the same node, and configure the memory usage of Spark and hadoop to avoid interference (for hadoop, the memory configuration parameters of each task are mapred.child.java.opts;mapreduce.tasktracker.map.tasks.maximum and mapreduce.tasktracker.reduce.tasks.maximum determine the number of task). You can also run hadoop and spark on a common cluster manager, such as mesos and yarn.

2, if not, run Spark on different nodes in the same LAN as HDFS.

3, for low-latency data storage, such as HBase, it may be preferred to run computing tasks on nodes different from the storage system to avoid interference.

Second, local disk

Although Spark can perform a large number of calculations in memory, it still uses local disks to store data that is not suitable for RAM, as well as intermediate results between stage, that is, shuffle. It is recommended that each node have at least 4-8 disks, and do not need RAID, just independent disks hanging on the node. In Linux, use the noatime option to install the disk to reduce unnecessary writes. In the spark task, the spark.local.dir configuration can have more than a dozen disk directories separated by commas. If you're running on hdfs, it's good to be consistent with hdfs.

Mounting the disk using the noatime option requires that the standard Linux mount option (noatime) be specified when the file system is mounted, which disables atime updates on the file system. Disk hanging on command:

Mount-t gfs BlockDevice MountPoint-o noatimeBlockDevice specifies the block device where the GFS file system resides. MountPoint specifies the directory where the GFS file system should be mounted. Example: mount-t gfs / dev/vg01/lvol0 / gfs1-o noatime

Third, memory

The memory of a single machine runs well from 8GB to hundreds of GB,spark. In all cases, it is recommended that you allocate up to 75% of the memory for Spark; leave the rest of the operating system and buffer cache.

How much memory you need depends on your application. To determine how much memory your application needs for a particular dataset, load part of the dataset into memory, and then check its memory usage in Spark UI's Storage interface.

Note that memory usage is greatly affected by storage levels and serialization formats-see another tuning article for tips on how to reduce memory usage.

Finally, note that JAVA VM does not always perform well for machines with more than 200GB memory. If you buy a machine with more memory than 200GB, you can run multiple worker on one node. In Spark Standalone mode, you can set the number of worker per node by setting the value of SPARK_WORKER_INSTANCES in the configuration file conf/spark-env.sh. You can also set the number of cpu per Worker by setting the SPARK_WORKER_CORES parameter.

Fourth, the network

According to past experience, if the data is in memory, then the bottleneck of the application of spark is often in the network. Using a network of 10 Gigabit or higher is the best way to make spark applications run faster. This is especially true for "distributed reduce" applications, such as group-bys,reduce-bys and sql joins. In any given application, you can use spark ui to see how much data the spark shuffle process boasts of network transmission.

Five, cpu

For machines with dozens of cpu per machine, spark can also be extended well because it performs the smallest shared cpu between threads. Each machine should be configured with at least 8-16 cores. Depending on the cpu load, more cpu may be required: once the data is in memory, the bottleneck for most applications is the CPU and the network.

These are all the contents of the article "what are the hardware requirements of spark?" Thank you for your reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.