Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to create the RDD of Spark

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to create the RDD of Spark". In the operation of actual cases, many people will encounter such a dilemma. Next, let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

One: Scala

Scala is a modern multi-paradigm programming language that aims to express common programming patterns in a concise, elegant and type-safe way. It smoothly integrates the features of object-oriented and functional languages. Scala runs on the Java platform (JVM,Java virtual machine) and is compatible with existing Java programs.

Start spark-shell by executing the following command:

Hadoop@master:/mysoftware/spark-1.6.1$ spark-shell

Second, resilient distributed datasets (RDD)

1.RDD (Resilient Distributed Dataset, resilient distributed dataset).

Spark is a distributed computing framework, and RDD is its abstraction of distributed memory data. It can be considered that RDD is the data structure of Spark distributed algorithm. The operation on RDD is the core primitive of Spark distributed algorithm, and the upper algorithm is designed by data structure and primitive. Spark will eventually translate the algorithm into a workflow in the form of DAG for scheduling and release distributed tasks.

RDD, which partitions data on multiple machines in the cluster, can be logically thought of as a distributed array, and each record in the array can be any user-defined data structure. RDD is the core data structure of Spark. The scheduling sequence of Spark is formed through the dependency of RDD, and the whole Spark program is formed through the operation of RDD.

How 2.RDD is created

Created from the output (HDFS) of the Hadoop file system (or other persistent storage system compatible with Hadoop, such as Hive,HBase).

2.2 convert the parent RDD to a new RDD

Create click data as a distributed RDD via parallelize or makeRDD.

Scala > var textFile = sc.textFile ("hdfs://192.168.226.129:9000/txt/sparkshell/sparkshell.txt"); textFile: org.apache.spark.rdd.RDD [String] = hdfs://192.168.226.129:9000/txt/sparkshell/sparkshell.txt MapPartitionsRDD [1] at textFile at: 27scala > val a = sc.parallelize (1 to 9,3) a: org.apache.spark.rdd.RDD [Int] = ParallelCollectionRDD [6] at parallelize at: 27scala >

There are two operators of 3.RDD: Transformation and Action.

Transformation (Transformation): delayed calculation, that is, the conversion operation from one RDD transformation to another RDD is not performed immediately, and the operation will not really be triggered until there is an Action operation.

Action (Action): the Action operator triggers the Spark to submit a job (Job) and outputs the data to the Spark system.

Important internal attributes of 4.RDD.

Partition list: through the partition list, you can find all the partitions contained in a RDD and their addresses.

4.2 calculate the function of each shard: through the function, you can perform the user-defined function operation required by RDD for each data block.

4.3Dependencies list on parent RDD: provide support for fault tolerance, etc., in order to be able to backtrack parent RDD.

4. 4. The partition policy and number of partitions for key-value pair data type RDD are controlled. Through the partition function, we can determine the distribution of data records on each partition and node, and reduce the imbalance of distribution.

4.5 A list of addresses for each data partition (such as the address of a block on HDFS).

If there are replicas of the data, all replica addresses of a single data block can be known through the address list, which provides support for load balancing and fault tolerance.

4. Spark computing workflow

The input, run transformation and output of Spark are described on the way. In the process of running the transformation, the RDD is transformed by operators. The function defined in the operator RDD can transform and manipulate the data in RDD.

Input: when the Spark program is running, the data is input from the external data space (eg:HDFS) to Spark, and the data enters the Spark runtime data space, which will be transformed into data blocks in Spark and managed through BlockManager.

Operation: after the Spark data input forms RDD, you can manipulate the data and convert RDD into a new RDD through the transformation operator fliter, etc., and trigger Spark to submit jobs through the action Action operator. If the data needs to be taken, the data can be cached into memory through the Cache operator.

Output: program run-end data will be exported to Spark runtime space and stored in distributed storage (such as saveAsTextFile output to HDFS) or Scala data or collection (collect output to Scala collection, count returns Scala Int data).

The core data model of Spark is RDD, but RDD is an abstract class, which is implemented by subclasses, such as MappedRDD,ShuffledRDD. Spark converts the commonly used big data operations into subclasses of RDD.

Use of some of its basic operations:

Scala > 3*7res0: Int = 21scala > var textFile = sc.textFile ("hdfs://192.168.226.129:9000/txt/sparkshell/sparkshell.txt") TextFile: org.apache.spark.rdd.RDD [String] = hdfs://192.168.226.129:9000/txt/sparkshell/sparkshell.txt MapPartitionsRDD [1] at textFile at: 27scala > textFile.count () res1: Long = 3 scala > textFile.first () res2: String = 1 sparkscala > textFile.filter (line = > line.contains ("berg"). Count () res3 : Long = 1scala > textFile.filter (line = > line.contains ("bergs")). Count () res4: Long = 0scala > textFile.map (line = > line.split (") .size) .reduce ((a) B) = > if (a > b) an else b) res5: Int = 1scala > textFile.map (line = > line.split ("\ t") .size). Reduce ((a, b) = > if (a > b) an else b) res6: Int = 2 "how to create RDD for Spark" ends here Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report