In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
SparkStreaming performance tuning Daquan!
1. The log is full:
Spark.executor.logs.rolling.maxSize
Remember to set the following three log rolling parameters:
Spark.executor.logs.rolling.strategy size
Spark.executor.logs.rolling.maxSize 134217728 # default byte
Spark.executor.logs.rolling.maxRetainedFiles
The following is the source code of spark1.6:
[spark] RollingFileAppender {= (*) .toString = =
2. Spark Streamingz manages the Offset of Kafka
Zookeeper.session.timeout.ms
It usually jumps by 3 to 5 times.
Http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/
Http://www.tuicool.com/articles/vaUzquJ
[spark] SparkCuratorUtil Logging {= (conf: SparkConfzkUrlConf:): CuratorFramework = {ZK_URL = conf.get (zkUrlConf) zk = CuratorFrameworkFactory.newClient (ZK_URLExponentialBackoffRetry ()) zk.start () zk}
III. Spark.task.maxFailures
Default 4, adjust about 10
TaskSetManagerSuite SparkFunSuite LocalSparkContext Logging {TaskLocality. {} = SparkConf = .getTimeAsMs () = () {beforeEach () FakeRackUtil.cleanUp ()} test () {sc = SparkContext () FakeTaskScheduler (sc ()) taskSet = FakeTask.createTaskSet () clock = ManualClock manager = TaskSetManager (schedtaskSetclock)
IV. Spark.streaming.kafka.maxRetries
Default 1, set to 3 or 5
5. Spark Streaming is connected to Kafka by Direct.
6. How to optimize? Where is the entrance?
The answer is where the Spark configuration parameters are:
1. $SPARK_HOME/conf/spark-env.sh script configuration. The configuration format is as follows:
Export SPARK_DAEMON_MEMORY=1024m
two。 How to program (use the System.setProperty ("xx", "xxx") statement to set the corresponding system property values before creating the SparkContext in the program)
Val conf = new SparkConf ()
.setMaster ("local")
.setAppName ("CountingSheep")
.set ("spark.executor.memory", "1g")
Val sc = new SparkContext (conf)
3. Configure under spark-shell and spark-submit
For example: Scala > System.setProperty ("spark.akka.frameSize", "10240m")
System.setProperty ("spark.rpc.askTimeout", "800")
. / bin/spark-submit-- name "My app"
-- master local [4]
-- conf spark.shuffle.spill=false
-conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps "
MyApp.jar
Spark-submit also selects configuration items from the default configuration file conf/spark-defaults.conf, in the following format:
Spark.master spark://iteblog.com:7077
Spark.executor.memory 512m
Spark.eventLog.enabled true
Spark.serializer org.apache.spark.serializer.KryoSerializer
(1) spark-env.sh configuration item of environment variable
SCALA_HOME # points to your scala installation path
MESOS_NATIVE_LIBRARY # if you want to run a cluster on Mesos
The amount of memory available for SPARK_WORKER_MEMORY # jobs, in the default format of 1000m or 2G (default: 1 GB for all RAM to be used by the operating system); the independent memory space for each job is determined by SPARK_MEM.
SPARK_JAVA_OPTS # add the JVM option. You can get any system properties through-D
Eg: SPARK_JAVA_OPTS+= "- Dspark.kryoserializer.buffer.mb=1024"
SPARK_MEM # sets the total amount of memory that each node can use. They should be consistent with the format of the JVM's-Xmx option (e.g.300m or 1g). Note: this option will soon be deprecated to support the system property spark.executor.memory, so we recommend it in the new code.
SPARK_DAEMON_MEMORY # memory space allocated to the Spark master and worker daemons (default 512m)
JVM options for SPARK_DAEMON_JAVA_OPTS # Spark master and worker daemons (default: none)
(2) System Properties
Spark.akka.frameSize: controls the maximum capacity of communication messages in Spark (such as the output of task), which defaults to 10m. When dealing with big data, the output of task may be greater than this value, so you need to set a higher value based on the actual data. If the error caused by this value is not large enough, you can troubleshoot it from the worker log. Usually, after a task on worker fails, a prompt of "Lost TID:" appears on the running log of master, which can be determined by checking whether the Serialized size of result of the task recorded in the log file of the failed worker (the log file under $SPARK_HOME/worker/) exceeds 10m.
Spark.default.parallelism: controls the default number of task used by distributed shuffle processes in Spark, which defaults to 8. If you do not make adjustments, when the amount of data is large, it is easy to run for a long time, or even Exception, because 8 task cannot handle as much data. Note that this value does not mean that the larger the setting, the better.
The temporary directory of the spark.local.dir:Spark runtime, such as the output file of map, the RDD saved on disk, and so on, is saved here. The default is / tmp, but at the beginning of the small cluster we built, the space of the / tmp directory is only 2G, and when a large amount of data runs, it will be Exception ("No space left on device").
How do I view the parameters that have been configured and take effect?
View through webui, http://master:4040/
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.