The mechanism of Spark SQL external data source and the use of spark-sql 07/04 Update SLTechnology News&Howtos

The mechanism of Spark SQL external data source and the use of spark-sql

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "the mechanism of Spark SQL external data source and the use of spark-sql". In daily operation, I believe that many people have doubts about the mechanism of Spark SQL external data source and the use of spark-sql. The editor consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "the mechanism of Spark SQL external data source and the use of spark-sql". Next, please follow the editor to study!

one。 Data interpretation and miscellaneous 1.External Data Source API external data source 2.json also has some disadvantages, for example, the first time you read id:1,name:xxx, the second id:1,name:xxx,session:222, so the code has to be changed and the data type if you are id: "xxx" is not good at all. Commonly used external data sources such as FileSystem:HDFS,Hbase,S3,OSS HDFS and mysql join you have to use sqoop to record hdfs,mysql in hive, but you can use spark to record the advantages and flexibility of 4.-packages--packages. It brings you the advantages and disadvantages of local access and downloading: clusters cannot surf the Internet in production, and maven is useless: yes-- jars can be uploaded into a jar package.

5. Internal built-in and external data sources such as json.vsv,hdfs,hive,jdbc,s3,parquet,es,redis are divided into two categories: build-in (built-in), 3th-party (external) spark.read.load () reads parquet files by default

6. Add the jar package externally and use the instance csv as an example to use the URL homepage of https://spark-packages.org.

7. Standard method of reading and writing

8. Custom constraint condition

9. Other complex types such as arrays are supported like hive

2. JDBC reading and writing problem 1. There is a problem with the file when writing (already exists)

two。 Solution: there is an exception target for loading file data on the official website, but there may be two copies at a time, there are drawbacks (it is not guaranteed that every processing is the same) the target table exists, and the existing data is cleared and ignored. If you have it, you won't add it any more.

3. If you want to see the contents of the file you write, you can uncompress user.select ("name"). Write.format ("json"). Option ("compression", "none"). Save ("file:///root/test/json1/") user.select (" name "). Write (). Format (" json "). Save (" / root/test/json1/ ") 4.mode source code display uppercase and lowercase

5.savemode is an enumerated type java class

6. The effect is the same. Result.write.mode ("default") result.write.mode (SaveMode.ErrorIfExists) 7.append has two copies at a time.

8. The driver attribute must be added when reading JDBC data on the official website.

9. Attribute interpretation official website

10. When reading jdbc, how many entries can be put into a partition, and which partition can be set for the rest of the partition? which field can have the least number of partitions, at most, how many partitions can enter at a time?

3. Use 1.jar package for spark-sql add Note if you can't add it, you have to add the last sentence, version problem.

2.spark-sql can directly load tables in hive. There is a table method in sparksession that can convert tables to DataFrame directly.

3. Load jdbc code

4.join pay attention to the three equal signs, otherwise an error will be reported, and pay attention to the conditions.

four。 Mechanism 1.PPD Optimization of external data sources

two。 Implement three interfaces or subclasses how to more effectively read external data source Table sCAN load external data source data, define data schema information Base (abstract class must have subclasses) write must implement RelationProvicer

The position of 3.TableScan corresponding to PDD optimization is in the first line of the figure above, whatever is read out.

Cropping corresponds to the second

Clipping and filtering correspond to the third two graphs with different parameters and consistent functions.

4. The other two source code implements these three interfaces, one can write schema information, the other can be written out with scan to check, insert to write, and base to load data sources and schema information.

Three interfaces or subclasses are implemented in the 5.JDBC source code

At this point, the study on "the mechanism of Spark SQL external data sources and the use of spark-sql" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.