How to use Spark2.1.0 07/08 Update SLTechnology News&Howtos

How to use Spark2.1.0

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to use Spark2.1.0". In the operation of actual cases, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Run spark-shell

In the article "preparing the runtime environment for Spark2.1.0", spark-shell was simply run and shown with the following figure (shown here again).

Figure 1 execute spark-shell to enter the Scala command line

A lot of information is shown in figure 1, and here are some instructions:

After installing Spark 2.1.0, if the configuration of log4j is not explicitly specified, Spark uses log4j-defaults.properties under the org/apache/spark/ directory of the core module as the default configuration of log4j. The Spark log level specified by log4j-defaults.properties is WARN. Users can copy a log4j.properties file from log4j.properties.template under the conf folder of the Spark installation directory and add the configuration they want to it.

In addition to specifying the log4j.properties file, you can also specify the log level through the sc.setLogLevel (newLevel) statement on the spark-shell command line.

The address of SparkContext's Web UI is: http://192.168.0.106:4040. 192.168.0.106 is the ip address of the machine where the author installed Spark, and 4040 is the default listening port for SparkContext's Web UI.

The specified deployment mode, or master, is local [*]. The ID of the current Application is local-1497084620457.

You can use SparkContext through sc and SparkSession through spark on the spark-shell command line. Sc and spark are actually the variable names of SparkContext and SparkSession in Spark REPL, the details of which have been analyzed in the article "Spark2.1.0 Analysis spark-shell".

Because the default log level for Spark core is WARN, you don't see much information. Now let's make a copy of the log4j.properties.template under the conf folder of the Spark installation directory with the following command:

Cp log4j.properties.template log4j.properties

And change the log4j.logger.org.apache.spark.repl.Main=WARN in log4j.properties to log4j.logger.org.apache.spark.repl.Main=INFO, and then we run spark-shell again, which will print out more information, as shown in figure 2.

Figure 2 partial information printed during the Spark startup process

From the startup log shown in figure 2, we can see SecurityManager, SparkEnv, BlockManagerMasterEndpoint, DiskBlockManager, MemoryStore, SparkUI, Executor, NettyBlockTransferService, BlockManager, BlockManagerMaster, and so on. What do they do? Readers who are new to Spark only need to know this information, which will be given in a later blog post.

Execute word count

In this section, we take a look at the execution of Spark tasks through the familiar example of word count. After you start spark-shell, the Scala command line opens and follow these steps to enter the script:

Step 1

Enter val lines = sc.textFile (".. / README.md", 2), and use the contents of the README.md file in the Spark installation directory as the data source for the word count example, and the execution result is shown in figure 3.

Figure 3 execution result of step 1

Figure 3 tells us that the actual type of lines is MapPartitionsRDD.

Step 2

The textFile method reads the text file line by line. We need to enter val words = lines.flatMap (line = > line.split ("")), separate each line of text by a space to get each word, and the result is shown in figure 4.

Figure 4 execution result of step 2

Figure 4 tells us that the actual type of words obtained by lines after being converted by the flatMap method is also MapPartitionsRDD.

Step 3

For each word you get, initialize the count of each word to 1 by typing val ones = words.map (w = > (wjournal 1)), and the execution result is shown in figure 5.

Figure 5 execution result of step 3

Figure 5 tells us that the actual type of ones obtained by words after being converted by the map method is also MapPartitionsRDD.

Step 4

Enter val counts = ones.reduceByKey (_ + _) to aggregate the count values of the words, and the execution result is shown in figure 6.

Figure 6 step 4 execution result

Figure 6 tells us that the actual type of counts obtained by ones after being converted by the reduceByKey method is ShuffledRDD.

Step 5

Type counts.foreach (println) to print out the count value for each word, and the execution of the job is shown in figures 7 and 8. The output of the job is shown in figure 9.

Figure 7 step 5 implementation process part 1

Figure 8 step 5 execution part 2

Figures 7 and 8 show a lot of information about job submission and execution, and the key elements are selected for introduction:

The ID generated by SparkContext for the submitted Job is 0.

There are four RDD, divided into ResultStage and ShuffleMapStage. The ID of ShuffleMapStage is 0 and the attempt number is 0. ResultStage has an ID of 1 and an attempt number of 0. In Spark, if the Stage execution is not complete, multiple retries will be made. Whether the Stage is executed for the first time or retried, it is considered a Stage attempt (Stage Attempt), and each Attempt has a unique attempt number (AttemptNumber).

Because Job has two partitions, two Task are committed for both ShuffleMapStage and ResultStage. Each Task also has multiple attempts, so there is also a trial number that belongs to Task. You can see from the figure that the trial numbers of the two Task in ShuffleMapStage and the two Task in ResultStage are also 0.

HadoopRDD is used to read the contents of the file.

Figure 9, step 5, output the result

Figure 9 shows the output of the word count and the final printed log information at the end of the task.

That's all for the content of "how to use Spark2.1.0". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.