How to configure spark initialization 07/12 Update SLTechnology News&Howtos

How to configure spark initialization

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to configure spark initialization". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn how to configure spark initialization.

1. At a higher level, each spark application contains a driver to call the user's main function to perform various parallel operations on the cluster. The main abstraction of spark is the provision of RDD data types. The RDD data type is a collection of elements that can be segmented on a cluster and manipulated in parallel. RDD can be created from files on HDFS, existing collections of drivers, or transformations of existing RDD. Users can also persist the RDD and store it in memory for effective reuse. RDD will also automatically recover from failure. The second abstraction of spark is shared variables that can be manipulated in parallel. By default, when spark runs the function, it starts separate tasks on different nodes. Spark supports two types of shared variables, broadcast variables, saving a value on each node, and cumulative variables support cumulative operations.

2. Spark initialization

Spark applications should first create JavaSparkContext objects

SparkConf conf = new SparkConf () .setAppName (appName) .setmaster (master); JavaSparkContext sc = new JavaSparkContext (conf)

The appname setting displays the application name on the cluster UI, where master is the URL of the YARN cluster, or the string "local" indicates running the local mode. If the jar package is submitted using the spark-submit command, it can be specified by the-- master option.

3. Create RDD

RDD can be created by parallelizing a collection that already exists in a driver, or by referencing an external storage system, such as HDFS,HBase

List data = Arrays.asList (1,2,3,4,5); JavaRDD distData = sc.parallelize (data, 20)

One of the important parameters of collection parallelization is the number of copies (slices). Each piece of data will correspond to a task. Spark automatically determines the number of copies based on the cluster by default.

JavaRDD distFile = sc.textFile ("data.txt", 20)

If the file name is a local file system, you need to copy a copy of the data in the same directory of all work nodes, or use a network shared file system. All spark file input methods also support directories, compressed files, and wildcards. By default, a block will correspond to a slice. In addition to text files, spark supports other data formats, such as JavaSparkContext.wholeTextFiles,SequenceFiles.

4. Two RDD operations are supported. Transformation generates a new dataset based on an existing dataset, and action runs calculations on the RDD and returns a value to the driver. For example: map is a transformation, all the data elements through a function, the result returns a new RDD,reduce is an action, using a function to aggregate all the elements and return a result to the driver. All transformation operations are not performed immediately after the call, only when an action needs to return a result to the driver. This design makes spark more efficient, for example, when you need to pass a RDD through a map and then use reduce to return a result, you only need to return the result of reduce to the driver, rather than the larger dataset after map. The default RDD does not always exist in memory, we can call the persist or cache methods to persist, or support the RDD to be persisted to the hard disk, or backed up to multiple nodes.

5. RDD transformation method

Map (T-> U) filter (T-> Boolean) flatMap (T-> Iterator) mapPartitions (Iterator-> Iterator)

MapPartitionsWithIndex ((int,Iterator)-> Iterator)

Thank you for your reading, the above is the content of "how to configure spark initialization". After the study of this article, I believe you have a deeper understanding of how to configure spark initialization, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.