How to understand SQL and Thrift 07/02 Update SLTechnology News&Howtos

How to understand SQL and Thrift

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about how to understand SQL and Thrift. Many people may not know much about it. In order to make you understand better, the editor has summarized the following for you. I hope you can get something from this article.

one

Hive SQL & Spark SQL

This is a complex history, basically a "ship Theseus" (Ship of Theseus) story. In the beginning, almost all of the Spark SQL code was copied by Hive, and over time, the Hive code was gradually replaced until there was almost no original Hive code.

Reference: https://en.wikipedia.org/wiki/Ship_of_Theseus

Spark initially packaged Shark and SharkServer (a combination of Spark and Hive). At that time, this combination contained a lot of Hive code. SharkServer is Hive, which parses HiveQL, optimizes in Hive, reads the input format of Hadoop, and finally Shark even runs Hadoop-style MapReduce tasks on the Spark engine. This was actually cool at the time because it provided a way to use Spark without any programming, and HQL did it. Unfortunately, MapReduce and Hive are not fully integrated into the Spark ecosystem, and in July 2014, the community announced that the development of Shark would be terminated at Spark1.0, as Spark began to shift to more Spark native SQL expressions. At the same time, the community shifts its focus to the development of native Spark SQL, and provides a transition solution Hive on Spark for existing Hive users to migrate Hive jobs to Spark engine execution.

Reference: https://github.com/amplab/shark/wiki/Shark-User-Guidehttps://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html

Spark introduces SchemaRDDs,Dataframes and DataSets to represent distributed datasets. With these, a new Spark native optimization engine called Catalyst is introduced to Spark, which is a Tree Manipulation Framework that provides the basis for all query optimization from GraphFrames to Structured Streaming. The advent of Catalyst means starting to discard MapReduce-style job execution, and instead you can build and run Spark-optimized execution plans. In addition, Spark has released a new API that allows us to build a Spark-Aware interface called "DataSources". DataSources's flexibility ends Spark's dependence on Hadoop input formats (although they are still supported). DataSource can directly access the query plan generated by Spark and perform predicate pushdown and other optimizations.

Hive Parser has been replaced by Spark Parser, and Spark SQL still supports HQL, but the syntax has been greatly expanded. Spark SQL can now run all TPC-DS queries, as well as a series of Spark-specific extensions. There is a time during development when you have to choose between HiveContext and SqlContext, both of which have different parsers, but we won't talk about it anymore. All requests today start with SparkSession). Now there is almost no Hive code left in Spark. Although Sql Thrift Server is still built on HiveServer2 code, almost all internal implementations are completely Spark native.

two

Spark Thrift Server introduction

Thrift Server is a service provided by Spark for JDBC/ODBC to access Spark SQL. It is implemented based on Hive1.2.1 's HiveServer2, but the underlying SQL execution is changed to Spark, and the service is started using spark submit. At the same time, through the Spark Thrift JDBC/ODBC interface, you can also directly access the Hive table in the same Hadoop cluster, and you can configure the Thrift service to point to the metastore service connected to the Hive.

Reference: http://spark.apache.org/docs/latest/sql-distributed-sql-engine.html#running-the-thrift-jdbcodbc-server

three

Defects of Spark Thrift

1. User impersonation is not supported, that is, Thrift Server cannot execute query statements with the user who submitted the query instead of the user who started Thrift Server. The hive.server2.enable.doAs parameter corresponding to Hive is not supported. Reference:

Https://issues.apache.org/jira/browse/SPARK-5159https://issues.apache.org/jira/browse/SPARK-11248https://issues.apache.org/jira/browse/SPARK-21918

two。 Because the above first point does not support user impersonation, any query is the same user, so there is no way to control the permissions of Spark SQL.

3. Single point of problem, all Spark SQL queries follow the same Spark Driver on a single Spark Thrift node, and any failure will cause all jobs on this unique Spark Thrift node to fail, thus requiring a restart of Spark Thrift Server.

4. The third reason mentioned above is that all queries have to go through a Spark Driver, which causes the Driver to be a bottleneck, thus limiting the concurrency of Spark SQL jobs.

Because the above limitations are mainly security (that is, the first and second points described above), the Enterprise Edition of CDH does not package Spark Thrift services when packaging Spark. If you want to use the Spark Thrift service in CDH, you need to package it or add it separately, but Cloudera officially does not provide support services. You can refer to the following jira:

Https://issues.cloudera.org/browse/DISTRO-817

For the defects of Spark Thrift, you can also refer to NetEase's description:

That's why NetEase made a Thrift service named Kyuubi,Fayson, which will be used in later articles. Refer to:

Http://blog.itpub.net/31077337/viewspace-2212906/

four

The use of Spark Thrift in existing CDH5

From CDH5.10 to the latest CDH5.16.1, you can install Spark1.6 at the same time, and the latest Spark2.x,Spark2, including Spark2.0 and the latest Spark2.4, can be installed into CDH5. For more information, please refer to the official website of Cloudera:

Https://www.cloudera.com/documentation/spark2/latest/topics/spark2_requirements.html#cdh_versions

Running the Thrift service through its own installation in CDH5 has now been called and is using the following version combination:

1. To install Spark1.6 's Thrift service in CDH5, refer to 0079-how to enable Spark Thrift in CDH.

two。 To install Spark2.1 's Thrift service in CDH5, refer to "0280-how to deploy Spark2.1 's Thrift and spark-sql clients in a CDH cluster in Kerberos environment"

From Spark2.2 to the latest Spark2.4, because of the great changes, it is not possible to use the above two methods to directly replace the jar package, and more dependency problems lead to the need to recompile or modify more things to use the latest Spark2.4 Thrift in CDH5.

After reading the above, do you have any further understanding of how to understand SQL and Thrift? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.