What does SparkSQL mean? 04/28 Update SLTechnology News&Howtos

What does SparkSQL mean?

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "what SparkSQL refers to", the content is easy to understand, clear, hope to help you solve your doubts, the following let the editor lead you to study and learn "what does SparkSQL mean" this article.

1. Introduction of Spark SQL

Spark SQL is a module of Apache Spark's, which is used to deal with structured data, which is generated after 1.0; SQL statements are mainly reflected in relational databases. SQL based on Hadoop in big data has Hive (SQL on Hadoop), but a large number of disk landing processes in the process of MapReduce computing consume a lot of Icano, reducing running efficiency, which is simply the framework of high stability, slow calculation and offline batch processing, so other SQL on Hadoop tools are produced.

SQL on Hadoop

Hive-submit HQL statement conversion MapReduce jobs to Yarn execution (metadata importance)

Impala-an open source interactive SQL query engine based on memory processing

Presto-distributed SQL query engine

Shark-- SQL statement translates Spark homework. Hive runs above Spark and relies on poor compatibility between Hive and Hive.

Drill-the query engine includes SQL/FILE/HDFS/S3

Phoenix-- based on SQL engine on Hbase

Hive on SQL is another route of community development, which belongs to the Hive development plan and takes Spark as the executive engine of Hive; the HIve job we mentioned before runs on Hadoop's MapReduce; now Hive is not limited to one engine, but can use MapReduce, Tez, Spark and other engines.

II. Spark SQL characteristics

Integration-SQL query interfacing with application

Unified data access-connect a variety of data sources (Hive, Avro, Parquet, ORC, JSON, and JDBC)

Integration with Hive, do not require Hive, use Hive to exist Metastores or use Hive-site file

Through the connection between JDBC and ODBC, the underlying layer of start-thriftserver is also Thrift protocol (Hive_server2 is based on Thrift protocol,)

Spark SQL is not only SQL, but far beyond SQL.

III. Advantages of Spark SQL

A: memory column storage (In-Memory Columnar Storage)

The table data of Spark SQL is stored in memory rather than the original JVM object storage.

Spark SQL column storage stores the columns of the same data type in a native array, and serializes the complex data types supported by Hive (such as array, map, etc.) and then stores them into a byte array. In this way, one JVM object per column is created, resulting in fast GC and compact data storage; in addition, efficient compression methods with low CPU overhead (such as dictionary encoding, row length coding, etc.) can be used to reduce memory overhead More interestingly, the performance of aggregate-specific columns that are frequently used in analysis queries is greatly improved because the data of these columns are put together and are easier to read into memory for calculation

B: bytecode generation technology (bytecode generation, i.e. CG)

An expensive operation in a database query is the expression in the query statement, mainly due to the memory model of JVM. For example, the following query:

There is an Aung in this query, if you use the general SQL syntax approach to deal with it, it will become an expression tree.

Select axib from table

When physically processing the expression tree, there will be seven steps as shown in the figure

1. To call the virtual function Add.eval (), you need to confirm the data types on both sides of the Add

two。 To call the virtual function a.eval (), you need to confirm the data type of a

3. Make sure that the data type of an is Int and box it.

4. To call the virtual function b.eval (), you need to confirm the data type of b

5. Make sure that the data type of b is Int and box it.

6. Call Add of type Int

7. Return the calculation result after packing

C:Scala code optimization

IV. Spark SQL operating architecture

Catalyst on the core part of SparkSQL, the quality of performance affects the overall performance, due to the short development time, the dotted part is the later version to achieve the function, the implementation part is already achieved.

Unresolved Logical Plan: unparsed logical execution plan

Schema Catalog: metadata management uses Unresolved Logical Plan to generate Logical Plan

Logical Plan: generate logical execution plan

Optimized Logical Plan: optimize the generated Logical Plan to generate a physical logic plan

Physical Plans: physical logic plans, possibly multiple, to generate the best physical logic based on Cost Model

The above is all the contents of this article "what does SparkSQL mean?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.