Spark SQL Note arrangement (1): an introduction to the overall background of Spark SQL 07/09 Update SLTechnology News&Howtos

Spark SQL Note arrangement (1): an introduction to the overall background of Spark SQL

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

[TOC]

Basic overview

1. After Spark 1.0, Spark officially launched Spark SQL. In fact, the earliest use is Hadoop's own Hive query engine; for example, MR2, our bottom layer is the running MR2 model, the bottom layer is the Hive-based query engine.

2. Later, Spark provided Shark; and then Shark was eliminated (Shark restricted the overall development of Spark SQL) and launched Spark SQL. The performance of Shark is an order of magnitude higher than that of Hive, and the performance of Spark SQL is one order of magnitude higher than that of Shark.

3. The predecessor of SparkSQL is Shark, which provides quick tools for technicians who are familiar with RDBMS but do not understand MapReduce. Hive arises at the historic moment, it is the only SQL-on-Hadoop tool running on Hadoop at that time. However, the landing process of a large number of intermediate disks in the process of MapReduce computing consumes a lot of Icano and reduces the running efficiency. In order to improve the efficiency of SQL-on-Hadoop, a large number of SQL-on-Hadoop tools begin to be produced, among which the more prominent ones are:

MapR's DrillCloudera's ImpalaShark.

4. But Hive has a fatal flaw, that is, its underlying layer is based on MR2, and MR2's shuffle is disk-based, which leads to the extremely low performance of Hive. Complex SQL ETL often occurs, which takes several hours or even dozens of hours to run.

5. Spark launched that Shark,Shark and Hive are actually closely related, and many things at the bottom of Shark still rely on Hive, but the three modules of memory management, physical planning and execution are modified, and the underlying memory-based computing model of Spark is used to improve performance by several to hundreds of times compared with Hive.

Characteristics of Spark SQL

1. However, with the development of Spark, for the ambitious Spark team, Shark relies too much on Hive (such as using Hive syntax parser, query optimizer, etc.), which restricts the established policy of Spark's One Stack Rule Them All and restricts the integration of various components of Spark, so the SparkSQL project is proposed. SparkSQL abandoned the original Shark code, absorbed some advantages of Shark, such as memory column storage (In-Memory Columnar Storage), Hive compatibility, and redeveloped SparkSQL code; because of getting rid of the dependence on Hive, SparkSQL got great convenience in data compatibility, performance optimization and component expansion.

2. The characteristics of Spark SQL

1) support multiple data sources: Hive, RDD, Parquet, JSON, JDBC, etc.

2), a variety of performance optimization techniques: in-memory columnar storage, byte-code generation, cost model dynamic evaluation and so on.

3), component extensibility: for SQL's syntax parser, parser and optimizer, users can redevelop and dynamically extend them.

Data compatibility is not only compatible with Hive, but also can obtain data from RDD, parquet files and JSON files. Future versions even support the acquisition of RDBMS data and NOSQL data such as cassandra.

In the aspect of performance optimization, in addition to adopting optimization techniques such as In-Memory Columnar Storage and byte-code generation, Cost Model will be introduced to dynamically evaluate the query, obtain the best physical plan, and so on.

Component extensions can be redefined and extended, whether they are SQL syntax parsers, parsers, or optimizers.

Reynold Xin, the host of the Shark project and the SparkSQL project, announced on June 1, 2014: stop the development of Shark, and the team will put all resources on the SparkSQL project. At this point, the development of Shark draws a sentence, but it also develops into two straight lines: SparkSQL and Hive on Spark.

Among them, SparkSQL continues to develop as a member of Spark ecology, and is no longer limited to Hive, but is compatible with Hive; and Hive on Spark is a Hive development plan, which takes Spark as one of the underlying engines of Hive, that is to say, Hive will no longer be limited to one engine, but can use Map-Reduce, Tez, Spark and other engines.

Spark SQL performance

With the emergence of Shark, the performance of SQL-on-Hadoop is 10-100 times higher than that of Hive:

So, how does SparkSQL perform without the limitations of Hive? Although not as impressive as Shark's performance improvement over Hive, it also performed very well:

Brief introduction of Spark SQL performance Optimization Technology

SparkSQL's table data is stored in memory not in the original JVM object storage mode, but in memory column storage, as shown in the following figure:

Memory column storage (in-memory columnar storage)

The main results are as follows: 1. This storage mode has great advantages in space consumption and read throughput.

For the original JVM object storage method, each object usually has to add 12-16 bytes of additional overhead. For a 270MB data, read into memory in this way, using about 970MB memory space (usually 2 to 5 times the native data space) In addition, in this way, each data record produces one JVM object, and if it is a data record with a size of 200B, the 32-gigabyte stack will produce 160 million objects, which can take several minutes for GC to process (the garbage collection time of JVM is linearly related to the number of objects in the stack). Obviously, this kind of memory storage is expensive and unaffordable for Spark based on memory computing.

2. For memory column storage, columns of all native data types are stored in native arrays, and complex data types supported by Hive (such as array, map, etc.) are sequenced and then connected into a byte array to store. In this way, one JVM object per column is created, resulting in fast GC and compact data storage; in addition, efficient compression methods with low CPU overhead (such as dictionary encoding, row length coding, etc.) can be used to reduce memory overhead More interestingly, the performance of aggregate-specific columns that are frequently used in analysis queries is greatly improved because the data of these columns are put together and are easier to read into memory for calculation.

An expensive operation in a database query is the expression in the query statement, mainly due to the memory model of JVM.

Bytecode generation technology (byte-code generation)

Spark SQL adds a codegen module to the expressions of its catalyst module. For expressions in SQL statements, such as sql such as select num + num from t, dynamic bytecode generation technology can be used to optimize their performance.

Optimization of Scala Code Writing

In addition, when writing code with Scala, SparkSQL tries to avoid inefficient and easy GC code; although it increases the difficulty of writing code, it still uses a unified interface for users without difficulty in use.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.