Test steps for RDD persistence performance 07/12 Update SLTechnology News&Howtos

Test steps for RDD persistence performance

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1 preface

It is not accurate to use Java/Scala code to conduct statistical testing by setting the start time and end time. The best way is to deploy the Spark application to the cluster and obtain the time by observing the statistical information of Spark UI. You will be more prepared, especially when you want to observe the performance improvement brought by RDD cache.

In order to better view the information provided by Spark UI, through the ease of operation, the following will use Spark Shell to do the test, so that you can easily use Spark Shell's localhost:8080 to view the application's execution information.

2 data preparation

The test is based on the classic helloword case calculated by big data-wordcount program, so you should first prepare a certain amount of data. The data I prepared here are as follows:

Yeyonghao@yeyonghaodeMacBook-Pro:~$ ls-lh wordcount_text.txt-rw-r--r-- 1 yeyonghao staff 127M 10 1 14:24 wordcount_text.txt

The amount of data does not need to be too large, otherwise you will have to wait for a long time. At the same time, there may be a problem that there is not enough content to cache RDD when caching RDD. If the amount of data is not too small, the time difference is small, and it is difficult to observe the effect.

3 Test 3.1 start Spark Shell

As follows:

Yeyonghao@yeyonghaodeMacBook-Pro:~$ sudo spark-shell-driver-memory 2GPassword:log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory) .log4j:WARN Please initialize the log4j system properly.log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.propertiesTo adjust logging level use sc.setLogLevel ("INFO") Welcome to _ _ / / _ _\ / / _ _ / `/ _ _ / / _ _ / /. _ _ /\ _ _ /\ _\ version 1.6.2 / _ / Using Scala version 2.10.5 (Java HotSpot (TM) 64-Bit Server VM Java 1.8.0 / 181) Type in expressions to have them evaluated.Type: help for more information.Spark context available as sc.18/10/01 14:39:36 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 18-10-01 14:39:36 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 18-10-01 14:39:38 WARN ObjectStore: Version information not found in metastore. Hive.metastore.schema.verification is not enabled so recording the schema version 1.2.018 10 WARN ObjectStore 01 14:39:38 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException18/10/01 14:39:39 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 14:39:39 on 18-10-01 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext.3.2 load text data and cache RDD

Load the data first, and set the transformation, as follows:

Scala > val linesRDD = sc.textFile ("/ Users/yeyonghao/wordcount_text.txt") linesRDD: org.apache.spark.rdd.RDD [String] = / Users/yeyonghao/wordcount_text.txt MapPartitionsRDD [1] at textFile at: 27scala > val retRDD = linesRDD.flatMap (_ split (")). Map ((_, 1)). ReduceByKey (_ + _) retRDD: org.apache.spark.rdd.RDD [(String, Int)] = ShuffledRDD [4] at reduceByKey at: 29

Cache RDD:

Scala > retRDD.cache () res0: retRDD.type = ShuffledRDD [4] at reduceByKey at: 293.3 trigger action operation for the first time and observe the result

Note that the above action does not trigger the calculation operation of Spark, but only when the action operator is executed, as shown below:

Scala > retRDD.count () res1: Long = 1388678

Open Spark UI at this time and observe the execution result:

Jobs interface:

Stages interface:

Storage interface:

Analysis: you can obviously see that there is a green dot in the reduceByKey in the DAG diagram, indicating that the RDD has been cached so that when you view the Storage interface, you can also see the cached RDD. In addition, it should be noted that in performing this operation, all the steps need to be performed, and then the retRDD is generated before caching it, so that the next time you need to use retRDD again. You don't have to do the previous operation, which saves a lot of time, and of course, it takes a certain amount of time to cache RDD in this operation.

3.4 execute action operation scala > retRDD.count () res1: Long = 1388678 again

Jobs interface:

Stages interface:

Storage interface:

Analysis, from the above observation, we can also know that none of the previous operations of retRDD are performed, it directly uses the cached RDD to perform the subsequent action operations, so the time has been greatly improved.

3.5 do not perform RDD caching, perform action operations multiple times (important)

Reopen Spark-shell and do the following:

Scala > val linesRDD = sc.textFile ("/ Users/yeyonghao/wordcount_text.txt") linesRDD: org.apache.spark.rdd.RDD [String] = / Users/yeyonghao/wordcount_text.txt MapPartitionsRDD [1] at textFile at: 27scala > val retRDD = linesRDD.flatMap (_ .split ("). Map ((_, 1)). ReduceByKey (_ + _) retRDD: org.apache.spark.rdd.RDD [(String) Int)] = ShuffledRDD [4] at reduceByKey at: 29scala > retRDD.count () res0: Long = 1388678scala > retRDD.count () res1: Long = 1388678scala > retRDD.count () res2: Long = 1388678

Jos interface:

Stages interface for all job:

Storage interface:

Take a look at the detailed stages interface of one of the next two job:

You can see that this is the same as after performing the RDD cache operation before, because in linestage, even if the last RDD does not show the operation of performing the RDD cache, it will be saved in memory. Of course, for example, if the retRDD here performs another transformation operation, then the `retRDD will not be cached after the action operation, and after iterative calculation, it will be converted to the next RDD. However, if the operation of retRDD is explicitly cached, it can be seen in the storage interface. No matter what operation it performs later, retRDD will still be stored in memory. This is the biggest difference between active caching RDD and non-active caching RDD.

4 description

There are a lot of details here there is no way to show, which requires further practice, if possible, read the source code is also a very good choice, of course, here also provides a very good way to verify, through such a process of operation, I believe there will be a greater improvement than the abstract concept to understand RDD persistence.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.