What is the way to create Spark RDD and how to use the operator 07/15 Update SLTechnology News&Howtos

What is the way to create Spark RDD and how to use the operator

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "the creation of Spark RDD and the use of operators". In daily operation, I believe that many people have doubts about the creation of Spark RDD and the use of operators. Xiaobian consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful for you to answer the doubts about "the creation of Spark RDD and the use of operators". Next, please follow the editor to study!

One: a simple understanding of RDD and RDD processing data

RDD, known as Resilient Distributed Datasets, is a fault-tolerant, parallel data structure that allows users to explicitly store data to disk and memory, and to control data partitioning.

The core concept of RDD:Spark is RDD (resilientdistributed dataset), which refers to a read-only, partitioned distributed dataset. All or part of the dataset can be cached in memory and reused among multiple computations.

RDD is essentially an in-memory dataset, and when accessing RDD, the pointer points to only the part related to the operation. For example, there is a column-oriented data structure, one of which is implemented as an array of Int and the other as an array of Float. If you only need to access the Int field, the pointer to the RDD can access only the Int array, avoiding scanning the entire data structure.

RDD divides operations into two categories: transformation and action. No matter how many transformation operations are performed, the RDD does not actually perform the operation, and the operation is triggered only when the action operation is performed. In the internal implementation mechanism of RDD, the underlying interface is based on iterators, which makes data access more efficient and avoids the memory consumption of a large number of intermediate results.

In implementation, RDD provides corresponding types inherited from RDD for transformation operations, for example, map operation returns MappedRDD, while flatMap returns FlatMappedRDD. When we perform a map or flatMap operation, we simply pass the current RDD object to the corresponding RDD object.

Note: for the Maven project created, the following are the dependencies in pom.xml:

Junit junit 4.12 org.apache.spark spark-core_2.10 1.6.1 Org.apache.hadoop hadoop-client 2.6.4 org.apache.spark spark-sql_2.10 1.6.1

Two: created from the output (HDFS) of the Hadoop file system (or other persistent storage systems compatible with Hadoop, such as Hive,HBase).

Eg: find the data length and total length of all lines in the HDFS file.

Public class TestRDD1 {public static void main (String [] args) {createRDDFromHDFS ();} private static void createRDDFromHDFS () {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext ("local", "Spark Test", conf) System.out.println (sc); JavaRDD rdd = sc.textFile ("hdfs://192.168.226.129:9000/txt/sparkshell/sparkshell.txt"); JavaRDD newRDD = rdd.map (new Function () {private static final long serialVersionUID = 1L) Public Integer call (String string) throws Exception {System.out.println (string + "" + string.length ()); return string.length ();}}) System.out.println (newRDD.count ()); int length = newRDD.reduce (new Function2 () {private static final long serialVersionUID = 1L; public Integer call (Integer int1, Integer int2) throws Exception {return int1+int2) ); System.out.println ("sum" + length);}}

Third: create click data as a distributed RDD through parallelize or makeRDD.

Eg: sum.

Public class TestRDD2 {public static void main (String [] args) {createRDDFromSuperRDD ();} / * JavaSparkContext (String master, String appName, SparkConf conf) * master-Cluster URL to connect to (e.g. Mesos://host:port, spark://host:port, local [4]). * appName-A name for your application, to display on the cluster web UI * conf-a SparkConf object specifying other Spark parameters * * / private static void createRDDFromSuperRDD () {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext ("local", "Spark Test", conf) System.out.println (sc); List list = new ArrayList (); for (int iTunes 1)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.