In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the knowledge of "sample Analysis of the use of Spark Cache". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Note: because the internal data file is used, it will not be published here. Just take a look at the test code and the test results.
This test is conducted in an interactive environment like JupyterNotebook. If it is a direct submit a job, the result may be different.
Test procedure
Initialize Spark
From pyspark.sql import SparkSession spark = SparkSession\ .builder\ .appName ("Cache Demo")\ .master ("spark://10.206.132.113:7077")\ .config ('spark.driver.memory',' 5g')\ .config ('spark.executor.memory',' 5g')\ .config ("spark.cores.max", 20)\ .getOrCreate ()
Read two files for testing, and one of them uses Cache
Ds1 = spark.read.json (os.path.join (data_path, "data.2018-01-04")) ds2 = spark.read.json (os.path.join (data_path, "data.2018-01-05")) ds1.cache () # cache for * dataframe.
Note: these two data files were generated on January 4 and January 5, respectively. The size is very close, all 3.1g.
To prevent Spark from doing any Cache impact experiments, read two different data files here.
Calculation time:
Import time def calc_timing (ds, app_name): T1 = time.time () related = ds.filter ("app_name ='% s'"% app_name) _ 1stRow = related.first () T2 = time.time () print "cost time:", T2-T1
Test results:
Calc_timing (ds1, "DrUnzip") # cost time: 13.3130679131 calc_timing (ds2, "DrUnzip") # cost time: 18.0472488403 calc_timing (ds1, "DrUnzip") # cost time: 0.868658065796 calc_timing (ds2, "DrUnzip") # cost time: 15.8150720596
You can see:
For DS1, although Cache is called, it is slow to perform filter operations for * times because it is not really used.
The second time I used DS1, it was much faster because of the cache.
On the other hand, there is little difference in the execution time of DS2.
If you go to Spark UI to check the execution time of each Job, you will find that it takes only 15 seconds to read data files.
Therefore, it can be guessed that after the DataFrame of Spark reads the data, even if it performs two identical operations, the time consumed cannot be reduced, because Spark does not put DS in memory by default.
This is the end of the "sample Analysis of the use of Spark Cache". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.