SparkStreaming performance tuning Daquan! 07/11 Update SLTechnology News&Howtos

SparkStreaming performance tuning Daquan!

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

SparkStreaming performance tuning Daquan!

1. The log is full:

Spark.executor.logs.rolling.maxSize

Remember to set the following three log rolling parameters:

Spark.executor.logs.rolling.strategy size

Spark.executor.logs.rolling.maxSize 134217728 # default byte

Spark.executor.logs.rolling.maxRetainedFiles

The following is the source code of spark1.6:

[spark] RollingFileAppender {= (*) .toString = =

2. Spark Streamingz manages the Offset of Kafka

Zookeeper.session.timeout.ms

It usually jumps by 3 to 5 times.

Http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/

Http://www.tuicool.com/articles/vaUzquJ

[spark] SparkCuratorUtil Logging {= (conf: SparkConfzkUrlConf:): CuratorFramework = {ZK_URL = conf.get (zkUrlConf) zk = CuratorFrameworkFactory.newClient (ZK_URLExponentialBackoffRetry ()) zk.start () zk}

III. Spark.task.maxFailures

Default 4, adjust about 10

TaskSetManagerSuite SparkFunSuite LocalSparkContext Logging {TaskLocality. {} = SparkConf = .getTimeAsMs () = () {beforeEach () FakeRackUtil.cleanUp ()} test () {sc = SparkContext () FakeTaskScheduler (sc ()) taskSet = FakeTask.createTaskSet () clock = ManualClock manager = TaskSetManager (schedtaskSetclock)

IV. Spark.streaming.kafka.maxRetries

Default 1, set to 3 or 5

5. Spark Streaming is connected to Kafka by Direct.

6. How to optimize? Where is the entrance?

The answer is where the Spark configuration parameters are:

1. $SPARK_HOME/conf/spark-env.sh script configuration. The configuration format is as follows:

Export SPARK_DAEMON_MEMORY=1024m

two。 How to program (use the System.setProperty ("xx", "xxx") statement to set the corresponding system property values before creating the SparkContext in the program)

Val conf = new SparkConf ()

.setMaster ("local")

.setAppName ("CountingSheep")

.set ("spark.executor.memory", "1g")

Val sc = new SparkContext (conf)

3. Configure under spark-shell and spark-submit

For example: Scala > System.setProperty ("spark.akka.frameSize", "10240m")

System.setProperty ("spark.rpc.askTimeout", "800")

. / bin/spark-submit-- name "My app"

-- master local [4]

-- conf spark.shuffle.spill=false

-conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails

-XX:+PrintGCTimeStamps "

MyApp.jar

Spark-submit also selects configuration items from the default configuration file conf/spark-defaults.conf, in the following format:

Spark.master spark://iteblog.com:7077

Spark.executor.memory 512m

Spark.eventLog.enabled true

Spark.serializer org.apache.spark.serializer.KryoSerializer

(1) spark-env.sh configuration item of environment variable

SCALA_HOME # points to your scala installation path

MESOS_NATIVE_LIBRARY # if you want to run a cluster on Mesos

The amount of memory available for SPARK_WORKER_MEMORY # jobs, in the default format of 1000m or 2G (default: 1 GB for all RAM to be used by the operating system); the independent memory space for each job is determined by SPARK_MEM.

SPARK_JAVA_OPTS # add the JVM option. You can get any system properties through-D

Eg: SPARK_JAVA_OPTS+= "- Dspark.kryoserializer.buffer.mb=1024"

SPARK_MEM # sets the total amount of memory that each node can use. They should be consistent with the format of the JVM's-Xmx option (e.g.300m or 1g). Note: this option will soon be deprecated to support the system property spark.executor.memory, so we recommend it in the new code.

SPARK_DAEMON_MEMORY # memory space allocated to the Spark master and worker daemons (default 512m)

JVM options for SPARK_DAEMON_JAVA_OPTS # Spark master and worker daemons (default: none)

(2) System Properties

Spark.akka.frameSize: controls the maximum capacity of communication messages in Spark (such as the output of task), which defaults to 10m. When dealing with big data, the output of task may be greater than this value, so you need to set a higher value based on the actual data. If the error caused by this value is not large enough, you can troubleshoot it from the worker log. Usually, after a task on worker fails, a prompt of "Lost TID:" appears on the running log of master, which can be determined by checking whether the Serialized size of result of the task recorded in the log file of the failed worker (the log file under $SPARK_HOME/worker/) exceeds 10m.

Spark.default.parallelism: controls the default number of task used by distributed shuffle processes in Spark, which defaults to 8. If you do not make adjustments, when the amount of data is large, it is easy to run for a long time, or even Exception, because 8 task cannot handle as much data. Note that this value does not mean that the larger the setting, the better.

The temporary directory of the spark.local.dir:Spark runtime, such as the output file of map, the RDD saved on disk, and so on, is saved here. The default is / tmp, but at the beginning of the small cluster we built, the space of the / tmp directory is only 2G, and when a large amount of data runs, it will be Exception ("No space left on device").

How do I view the parameters that have been configured and take effect?

View through webui, http://master:4040/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.