How to use Spark Data Sources 12/14 Update SLTechnology News&Howtos

How to use Spark Data Sources

2025-12-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use Spark Data Sources". Friends who are interested might as well take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn how to use Spark Data Sources.

1: Data Sources (data source):

1.1 understand data sources.

Spark SQL supports the operation of various data sources through the DataFrame interface. DataFrame can operate as a normal RDDs or it can be registered as a temporary table.

Registering DataFrame as a table allows you to run SQL queries on its data. This section describes general methods for loading and saving data that uses Spark data sources, and then goes to specific options for the available built-in data sources.

1.2 Generic Load/Save Functions (universal load / save function).

In the simplest form, the default data source (parquet unless otherwise configured by spark.sql.sources.default) will be used for all operations.

Eg: the first way to read: read through parquetFile ("xxx").

First upload the users.parquet under spark-1.6.1-bin-hadoop2.6\ examples\ src\ main\ resources to HDFS.

Public class SparkSqlDemo4 {private static String appName = "Test Spark RDD"; private static String master = "local"; public static void main (String [] args) {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext (master, appName, conf) / / the context of sqlContext is created, and note that it is the starting point of DataFrame. SQLContext sqlContext = new SQLContext (sc); DataFrame df = sqlContext.read (). Load ("hdfs://192.168.226.129:9000/txt/sparkshell/users.parquet"); df.select ("name", "favorite_color"). Write (). Save ("namesAndFavColors.parquet") / / specify the save mode / / df.select ("name", "favorite_color") .write () .mode (SaveMode.Overwrite) .save ("namesAndFavColors.parquet"); / / the first read mode DataFrame parquetFile = sqlContext.parquetFile ("namesAndFavColors.parquet"); parquetFile.registerTempTable ("users") DataFrame df1 = sqlContext.sql ("SELECT name,favorite_color FROM users"); df1.show (); List listString = df1.javaRDD (). Map (new Function () {private static final long serialVersionUID = 1L) Public String call (Row row) {return "Name:" + row.getString (0) + ", FavoriteColor:" + row.getString (1);}}) .collect () For (String string: listString) {System.out.println (string);}

The output is as follows:

+-+ | name | favorite_color | +-+-+ | Alyssa | null | | Ben | red | +-+-+ Name: Alyssa, FavoriteColor: nullName: Ben, FavoriteColor: red

1.3 Manually Specifying Options (manually specify options):

The data source that you can also specify manually will be used with any additional options you want to pass to the data source.

Data source by its fully qualified name (that is, org.apache.spark.sql.parquet)

But you can also use their short names (json, parquet, jdbc) for built-in sources.

Any type of DataFrames can be converted to another type, using this syntax.

Eg: the second way to read: read through parquet ("xxx").

Public class SparkSqlDemo5 {private static String appName = "Test Spark RDD"; private static String master = "local"; public static void main (String [] args) {SparkConf conf = new SparkConf (); conf.set ("spark.testing.memory", "269522560000"); JavaSparkContext sc = new JavaSparkContext (master, appName, conf) / / the context of sqlContext is created, and note that it is the starting point of DataFrame. SQLContext sqlContext = new SQLContext (sc); DataFrame df = sqlContext.read (). Format ("json"). Load ("hdfs://192.168.226.129:9000/txt/sparkshell/people.json"); df.select ("id", "name", "sex", "age"). Write (). Format ("parquet"). Save ("people.parquet") DataFrame parquetFile = sqlContext.read () .parquet ("people.parquet"); parquetFile.registerTempTable ("people"); DataFrame df1 = sqlContext.sql ("SELECT id,name,sex,age FROM people WHERE age > = 21 AND age")

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.