Spark Series (10)-- Spark SQL external data Source 07/03 Update SLTechnology News&Howtos

Spark Series (10)-- Spark SQL external data Source

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. introduction 1.1 support for multiple data sources

Spark supports the following six core data sources, while the Spark community also provides access to hundreds of data sources to meet most usage scenarios.

CSVJSONParquetORCJDBC/ODBC connectionsPlain-text files

Note: all the following test files can be downloaded from the resources directory of this warehouse.

1.2 read data format

All read API follow the following call format:

/ / format DataFrameReader.format (...). Option ("key", "value"). Schema (...). Load () / sample spark.read.format ("csv"). Option ("mode", "FAILFAST") / / read mode .option ("inferSchema", "true") / / whether schema.option ("path") is automatically inferred "path/to/file (s)") / / File path .schema (someSchema) / / use the predefined schema .load ()

There are three options for read mode:

Read mode describes that permissive sets all its fields to null when it encounters a corrupted record And put all corrupted records in a string column named _ corruption t_record dropMalformed delete malformed rows failFast immediately fails to write data format / / format DataFrameWriter.format (.). Option (...). PartitionBy (.). BucketBy (.). SortBy (...). Save () / sample dataframe.write.format ("csv"). Option ("mode") "OVERWRITE") / / write mode .option ("dateFormat", "yyyy-MM-dd") / / date format .option ("path", "path/to/file (s)") .save ()

There are four options for writing data modes:

Scala/Java describes SaveMode.ErrorIfExists if a file already exists in a given path, an exception is thrown, which is the default mode of write data. SaveMode.Append data is written to SaveMode.Overwrite data in an appended manner and overwritten to SaveMode.Ignore. If a file already exists in a given path, nothing is done.

II. CSV

CSV is a common text file format in which each line represents a record and each field in the record is separated by a comma.

2.1 read CSV files

Automatic inference type reading example:

Spark.read.format ("csv") .option ("header", "false") / / whether the first line in the file is the name of the column .option ("mode", "FAILFAST") / / whether to fail quickly. Option ("inferSchema", "true") / / whether to automatically infer schema.load ("/ usr/file/csv/dept.csv") .show ()

Use predefined types:

Import org.apache.spark.sql.types. {StructField, StructType, StringType,LongType} / / predefined data format val myManualSchema = new StructType (StructField ("deptno", LongType, nullable = false), StructField ("dname", StringType,nullable = true), StructField ("loc", StringType,nullable = true)) spark.read.format ("csv"). Option ("mode") "FAILFAST") .schema (myManualSchema) .load ("/ usr/file/csv/dept.csv") .show () 2.2 write to the CSV file df.write.format ("csv") .mode ("overwrite") .save ("/ tmp/csv/dept2")

You can also specify a specific delimiter:

Df.write.format ("csv") .mode ("overwrite") .option ("sep", "\ t") .save ("/ tmp/csv/dept2") 2.3 optional configuration

In order to save the space of the main text, all read-write configuration items can be found in section 9.1 at the end of the article.

3. JSON3.1 reads the JSON file spark.read.format ("json"). Option ("mode", "FAILFAST"). Load ("/ usr/file/json/dept.json"). Show (5)

It is important to note that by default, a data record does not span multiple rows (as follows). You can change it by configuring multiLine to true, and its default value is false.

/ / single line {"DEPTNO": 10, "DNAME": "ACCOUNTING", "LOC": "NEW YORK"} / / multiple lines {"DEPTNO": 10, "DNAME": "ACCOUNTING", "LOC": "NEW YORK"} 3.2.Writing JSON file df.write.format ("json") .mode ("overwrite") .save ("/ tmp/spark/json/dept") 3.3 optional configuration is supported by default.

In order to save the space of the main text, all read-write configuration items can be found in section 9.2 at the end of the article.

IV. Parquet

Parquet is an open source column-oriented data store that provides a variety of storage optimizations that allow reading individual columns but not the entire file, which not only saves storage space but also improves read efficiency. It is the default file format for Spark.

Read the Parquet file spark.read.format ("parquet") .load ("/ usr/file/parquet/dept.parquet") .show (5) 2.2 write to the Parquet file df.write.format ("parquet") .mode ("overwrite") .save ("/ tmp/spark/parquet/dept") 2.3 optional configuration

Parquet files have their own storage rules, so there are few optional configuration options. The following two are commonly used:

Read-write operation configuration options default values describe the Writecompression or codecNone,uncompressed,bzip2,deflate, gzip,lz4, or snappyNone compressed file format ReadmergeSchematrue, false depends on the configuration item spark.sql.parquet.mergeSchema when true, the Parquet data source merges the Schema collected by all data files, otherwise Schema is selected from the summary file, or Schema is selected from the random data file if no summary file is available.

More optional configurations can be found in the official document: https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

5. ORC

ORC is a self-describing, type-aware column file format that is optimized for reading and writing large data. It is also a commonly used file format in big data.

Read ORC file spark.read.format ("orc") .load ("/ usr/file/orc/dept.orc") .show (5) 4.2 write ORC file csvFile.write.format ("orc") .mode ("overwrite") .save ("/ tmp/spark/orc/dept")

VI. SQL Databases

Spark also supports data reading and writing with traditional relational databases. However, the Spark program does not provide a database driver by default, so you need to upload the corresponding database driver to the jars directory under the installation directory before using it. The following example uses a Mysql database, and the corresponding mysql-connector-java-x.x.x.jar needs to be uploaded to the jars directory before use.

6.1 read data

An example of reading full table data is as follows. Here help_keyword is the dictionary table built into mysql, with only two fields: help_keyword_id and name.

Spark.read.format ("jdbc") .option ("driver", "com.mysql.jdbc.Driver") / / driver .option ("url", "jdbc:mysql://127.0.0.1:3306/mysql") / / database address .option ("dbtable", "help_keyword") / / table name .option ("user", "root") .option ("password") "root"). Load (). Show (10)

Read data from the query results:

Val pushDownQuery = "" (SELECT * FROM help_keyword WHERE help_keyword_id

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.