Analysis of the use example of Spark cache 07/19 Update SLTechnology News&Howtos

Analysis of the use example of Spark cache

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the knowledge of "sample Analysis of the use of Spark Cache". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Note: because the internal data file is used, it will not be published here. Just take a look at the test code and the test results.

This test is conducted in an interactive environment like JupyterNotebook. If it is a direct submit a job, the result may be different.

Test procedure

Initialize Spark

From pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName ("Cache Demo")\ .master ("spark://10.206.132.113:7077")\ .config ('spark.driver.memory',' 5g')\ .config ('spark.executor.memory',' 5g')\ .config ("spark.cores.max", 20)\ .getOrCreate ()

Read two files for testing, and one of them uses Cache

Ds1 = spark.read.json (os.path.join (data_path, "data.2018-01-04")) ds2 = spark.read.json (os.path.join (data_path, "data.2018-01-05")) ds1.cache () # cache for * dataframe.

Note: these two data files were generated on January 4 and January 5, respectively. The size is very close, all 3.1g.

To prevent Spark from doing any Cache impact experiments, read two different data files here.

Calculation time:

Import time def calc_timing (ds, app_name): T1 = time.time () related = ds.filter ("app_name ='% s'"% app_name) _ 1stRow = related.first () T2 = time.time () print "cost time:", T2-T1

Test results:

Calc_timing (ds1, "DrUnzip") # cost time: 13.3130679131 calc_timing (ds2, "DrUnzip") # cost time: 18.0472488403 calc_timing (ds1, "DrUnzip") # cost time: 0.868658065796 calc_timing (ds2, "DrUnzip") # cost time: 15.8150720596

You can see:

For DS1, although Cache is called, it is slow to perform filter operations for * times because it is not really used.

The second time I used DS1, it was much faster because of the cache.

On the other hand, there is little difference in the execution time of DS2.

If you go to Spark UI to check the execution time of each Job, you will find that it takes only 15 seconds to read data files.

Therefore, it can be guessed that after the DataFrame of Spark reads the data, even if it performs two identical operations, the time consumed cannot be reduced, because Spark does not put DS in memory by default.

This is the end of the "sample Analysis of the use of Spark Cache". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.