Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Analysis of Spark Thrift JDBCServer Application scenarios and practical cases

2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

[TOC]

Spark Thrift JDBCServer application scenario analysis and actual combat case 1 preface

The Spark Thrift JDBCServer mentioned here is not the JDBC method used by most of the Spark data written on the Internet to land in the RDB database, but means that Spark starts a process called thriftserver for clients to provide JDBC connections, and then uses SQL statements for query and analysis.

Http://spark.apache.org/docs/2.3.3/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

In the following article analysis, I will first explain a basic evolution process, that is, why Spark Thrift JDBCServer is used, and what is its position in big data's ecological stack? I think it will be more "natural" to use Spark Thrift JDBCServer after understanding this. However, if readers themselves have experienced the use of MapReduce, Hive, Spark On Yarn, Spark On Yarn With Hive, Spark SQL, etc., I believe that as long as they have read the introduction of the official documents, they should be able to know its location and the reasons for its emergence. So this part is actually to introduce the role and evolution of some related big data components, through these introductions to understand the position of Spark Thrift JDBCServer.

In the actual work scenario, maybe you don't build these environments by yourself, maybe you just need to connect to Spark Thrift JDBCServer to use the analytical capabilities of Spark SQL, or develop some service middleware based on Spark Thrift JDBCServer, but you still need to master the principle and eagerly hope to build a simple environment to experience. I will outline how to apply Spark Thrift JDBCServer and what to pay attention to in the pseudo-distributed environment of Hadoop that has been built.

Most of the articles are explained according to personal understanding, and if there are any relevant mistakes, we also hope to criticize and correct them.

2 the evolution and integration of Hadoop ecology and Spark ecology analyzed by SQL big data

Big data products or big data platform, no matter how complex the underlying technology is used, ultimately hope that the products will be put into the hands of users, can be quickly and easily used for big data analysis, and deal with a large amount of data as soon as possible and as much as possible. to better mine the value of the data. One of the best tools or languages for data analysis is SQL, so most data big data products, frameworks or technologies generally provide SQL interfaces. Looking at the current more mainstream big data framework, it is also the case, such as Hive, Spark SQL, Elasticsearch SQL, Druid SQL and so on.

Here is a brief introduction

2.1 Hadoop MapReduce

MapReduce is the distributed computing framework of Hadoop, combined with Hadoop's distributed storage HDFS, which makes large-scale batch data processing possible. Through the simple interface provided by MapReduce, users can quickly build distributed applications without knowing its underlying layer, which greatly improves the efficiency of developing distributed data processing programs.

However, because the intermediate results of MapReduce are stored on disk in the process of data processing, its processing speed is very slow. Nevertheless, MapReduce will still be a good choice for large-scale offline data processing.

2.2 SQL On MapReduce: Hive

Although it is relatively simple to develop distributed programs based on the interface provided by MapReduce, because it still needs to be coded, it will still have a lot of learning costs for some data analysts or operators who have never been in contact with programming. So Hive appeared.

Hive, known as SQL On Hadoop or SQL On MapReduce, is the basic framework of a data warehouse based on Hadoop. Simply understand, on Hive, you can write SQL statements to analyze your data as in RDB. Hive's interpreter will convert SQL statements into MapRedcue jobs and submit them to Yarn to run. In this way, as long as you can write SQL statements, you can build powerful MapReduce distributed applications.

2.3 hiveserver2 of Hive JDBC

Hive provides a command line terminal. On the machine where Hive is installed, after configuring the metadata information database and specifying the configuration file of Hadoop, you can enter the hive command, and then you can enter the interactive terminal of hive. Then you just need to write the SQL statement, which is similar to the terminal provided by the traditional RDB database.

We know that traditional RDB databases, such as MySQL, not only provide interactive terminal operations, but also can be coded in the code to connect to MySQL for operations, such as Java can be connected through JDBC, after all, in the actual business, it is more likely to use the programming interface it provides, rather than just interactive terminals.

Hive is similar. Hive provides a user interface for jdbc in addition to the previous cli user interface, but if you need to use this interface, you need to start the hiveserver2 service first. After starting the service, you can continue to operate hive in the way of cli through the beeline provided by hive (but it should be noted that hive is operated through the jdbc interface at this time), or by writing java code by hand.

With hiverserver2, you can connect through Java JDBC to implement more and more complex business logic.

2.4 Spark

Spark is also a distributed computing engine, which abstracts the processed data as RDD or Dataset into memory, and the results of intermediate processing are also stored in memory, so it is 10 to 100 times faster than MapReduce.

Based on the interface and various operators provided by Spark, it is very easy to develop a powerful distributed data processing program.

2.5 Spark SQL

When using the basic functions of Spark, you also need to use code to operate. In order to make it more convenient to use Spark, it also provides a SQL-related interface-Spark SQL.

This seems to be very similar to the CLI function provided by Hive in MapReduce, but unlike Hive, you still need to programmatically use some code for table creation and metadata settings before you can continue to use Spark SQL statements for table operations, which should be well known to students who have used SQL. When using Hive, you can directly write SQL statements to create tables, write data, and analyze data, without the need for additional code operations.

2.6 Spark SQL On Hive

How to avoid this embarrassment in the previous Spark SQL? One of the branches of Spark SQL is Spark on Hive, that is, using the logic of HQL parsing, logical execution plan translation, and execution plan optimization in Hive, it can be approximately considered that only the physical execution plan has been replaced from a MR job to a Spark job. SparkSql integrates hive is to get the metadata information in the hive table, and then manipulate the data through SparkSql.

2.7 Spark Thrift JDBCServer of Spark SQL JDBC

Like hiveserver2, Spark Thrift JDBCServer is a process of Spark that can be connected through Java JDBC code after startup, which is essentially an Application of Spark.

Spark Thrift JDBCServer itself can also be integrated with Hive.

The use of Spark Thrift JDBCServer is based on the following and other considerations:

1. Hope to use SQL for data analysis; 2. Be able to connect through Java JDBC; 3. Based on memory computing, fast data processing; 4. Can be integrated with Hive; 5. Can schedule resources based on Yarn; 2.8integration of Spark, Hadoop and Hive

Spark applications are now generally deployed to Hadoop's Yarn for scheduling, although Spark itself provides a deployment model for standalone.

When using Spark SQL, because most of the data is generally saved on HDFS, and Hive itself is to manipulate the data on HDFS, Spark SQL and Hive are generally used together, that is, as mentioned in 2.6.The metadata information is based on Hive tables, while the computing engine used to deal with the data is Spark.

When you want to use Spark SQL's capabilities through Java JDBC, you can use Spark Thrift JDBCServer, and it itself can be integrated with Hive.

3 Spark Thrift JDBCServer practice 3.1 Spark Thrift JDBCServer Quick start 3.1.1 launch

It is very easy to use and requires almost no operation. Spark-2.3.3-bin-hadoop2.6.tgz version is used here, and the download link is as follows:

Https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.6.tgz

Here the use of domestic Apache mirror source, the download speed is very fast! Recommended for everyone to use: https://mirrors.tuna.tsinghua.edu.cn/apache/

After unzipping the downloaded installation package, start directly:

$cd sbin/$. / start-thriftserver.sh

Listen on port 10000 by default:

$lsof-i:10000COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEjava 1414 yeyonghao 407u IPv6 0x3cb645c07427abbb 0t0 TCP *: ndmp (LISTEN)

As mentioned earlier, it is essentially an Application of Spark, so you can see that port 4040 is also started at this time:

$lsof-i:4040COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEjava 1414 yeyonghao 270u IPv6 0x3cb645c07427d3fb 0t0 TCP *: yo-main (LISTEN)

Using the jps command to view, you can see that there is a SparkSubmit process:

$jps901 SecondaryNameNode1445 Jps806 DataNode1414 SparkSubmit729 NameNode1132 NodeManager1053 ResourceManager

I have also started the pseudo-distributed environment of Hadoop here.

You might as well open a browser and take a look at the page on port 4040:

Can be said to be quite familiar with the page, notice that the upper right corner of its name is: Thrift JDBC/ODBC Server, start Thriftserver, which is essentially a submission of an Application of Spark! (if you have used Spark Shell, you should know that Spark Shell is also an Application of Spark.)

3.1.2 connecting using beeline

So how do you connect? Spark provides a beeline connection tool.

$cd bin/$. / beeline

Then connect to the Thriftserver:

Beeline version 1.2.1.spark2 by Apache Hivebeeline >! connect jdbc:hive2://localhost:10000Connecting to jdbc:hive2://localhost:10000Enter username for jdbc:hive2://localhost:10000:Enter password for jdbc:hive2://localhost:10000:2019-07-13 15:58:40 INFO Utils:310-Supplied authorities: localhost:100002019-07-13 15:58:40 INFO Utils:397-Resolved authority: localhost:100002019-07-13 15:58:40 INFO HiveConnection:203-Will try to open client transport with JDBC Uri: jdbc:hive2://localhost:10000Connected to: Spark SQL (version 2.3.3) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://localhost:10000 >

Various SQL operations can then be performed:

0: jdbc:hive2://localhost:10000 > create table person0: jdbc:hive2://localhost:10000 > (0: jdbc:hive2://localhost:10000 > id int,0: jdbc:hive2://localhost:10000 > name string0: jdbc:hive2://localhost:10000 >) +-+-- + | Result | +-+-- + No rows selected (1.116 seconds) 0: jdbc:hive2://localhost:10000 > insert into person values +-+-- + | Result | +-+-- +-+-- + No rows selected (1.664 seconds) 0: jdbc:hive2://localhost:10000 > select * from person +-- + | id | name | +-- + | 1 | xpleaf | +-+ 1 row selected (0.449 seconds)

At this time, go to the 4040 page mentioned earlier and have a look:

You can see that our operations are actually transformed into Job operations in Spark Application.

3.1.3 use Java JDBC to connect

Since it is a JDBC service, it can of course be operated through Java code.

Create a Maven project and add the following dependencies:

Org.apache.hive hive-jdbc 2.1.0 org.apache.hadoop hadoop-core 1.2.1

The code is as follows:

Package cn.xpleaf.spark;import java.sql.Connection;import java.sql.DriverManager;import java.sql.ResultSet;import java.sql.Statement;/** * @ author xpleaf * @ date 4:06 on 2019-7-13 PM * / public class SampleSparkJdbcServer {public static void main (String [] args) throws Exception {Class.forName ("org.apache.hive.jdbc.HiveDriver"); Connection connection = DriverManager.getConnection ("jdbc:hive2://localhost:10000") Statement statement = connection.createStatement (); String sql = "select * from person"; ResultSet resultSet = statement.executeQuery (sql); while (resultSet.next ()) {int id = resultSet.getInt ("id"); String name = resultSet.getString ("name"); System.out.println ("id:% s, name:% s", id, name)) }}}

The running result after startup is as follows:

Notes on id: 1, name: xpleaf3.1.4

The table created and the data written in the previous method are kept in memory, so as soon as thirfserver exits, the data will be lost, so in order to persist the data, we will integrate with Hive later.

3.2 Spark Thirft JDBCServer Integration Hive

One of the obvious benefits of integrating Hive is that we can not only rely on HDFS for distributed storage and persist our data, but also rely on the fast computing power of Spark itself to quickly process the data, and in the middle, we need to use Hive as a "middleman". In essence, we use various metadata information tables created by Hive.

3.2.1 Hive installation

The Hadoop environment needs to be built before installing Hive. I don't introduce how to build the Hadoop environment here. On my machine, I have built a pseudo-distributed environment of Hadoop.

$jps901 SecondaryNameNode1557 RemoteMavenServer806 DataNode729 NameNode1834 Jps15471132 NodeManager1053 ResourceManager

In fact, the three prerequisites for Hive installation are:

JDK / / Java environment HADOOP / / Hadoop environment MySQL / / relational database, persistent storage of Hive metadata information

Here is the assumption that all three steps have been completed.

The domestic Apache image source described above can still be used to download Hive:

Https://mirrors.tuna.tsinghua.edu.cn/apache/hive/hive-2.3.5/apache-hive-2.3.5-bin.tar.gz

That is, the version used here is 2.3.5.

After the download is complete, extract it to the specified directory, and then configure the relevant files.

(1) configure hive-env.sh

Export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Homeexport HADOOP_HOME=/Users/yeyonghao/app/hadoopexport HIVE_HOME=/Users/yeyonghao/app2/hive

(2) configure hive-site.xml

Javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true javax.jdo.option.ConnectionDriverName com.mysql.jdbc.Driver javax.jdo.option.ConnectionUserName root javax.jdo.option.ConnectionPassword root hive.querylog.location / Users/yeyonghao/app2/hive/tmp hive.exec.local.scratchdir / Users/yeyonghao/app2/hive/tmp hive.downloaded.resources.dir / Users/yeyonghao/app2/hive/tmp

(3) copy the mysql driver to the $HIVE_HOME/lib directory

Download directly from maven:

~ / app2/hive/lib$ wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.39/mysql-connector-java-5.1.39.jar

(4) initialize Hive Metabase

~ / app2/hive/bin$. / schematool-initSchema-dbType mysql-userName root-passWord root

After success, you can see the created hive database and related tables in mysql:

Mysql > use hive;Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with-ADatabase changedmysql > show tables +-- + | Tables_in_hive | +-+ | AUX_TABLE | | BUCKETING_COLS | | CDS | | COLUMNS_V2 | | COMPACTION_QUEUE | | COMPLETED_COMPACTIONS | | COMPLETED_TXN_COMPONENTS | | DATABASE_PARAMS | | DBS | | DB_PRIVS | | DELEGATION_TOKENS | | FUNCS | | FUNC_RU | | GLOBAL_PRIVS | | HIVE_LOCKS | | IDXS | | INDEX_PARAMS | | | KEY_CONSTRAINTS | | MASTER_KEYS | | NEXT_COMPACTION_QUEUE_ID | | NEXT_LOCK_ID | | NEXT_TXN_ID | | NOTIFICATION_LOG | | NOTIFICATION_SEQUENCE | | NUCLEUS_TABLES | | PARTITIONS | | PARTITION_EVENTS | | PARTITION_KEYS | PARTITION_KEY_VALS | | PARTITION_PARAMS | | | PART_COL_PRIVS | | PART_COL_STATS | | PART_PRIVS | | ROLES | | ROLE_MAP | | SDS | | SD_PARAMS | | SEQUENCE_TABLE | | SERDES | | SERDE_PARAMS | | SKEWED_COL_NAMES | | | SKEWED_COL_VALUE_LOC_MAP | | SKEWED_STRING_LIST | | SKEWED_STRING_LIST_VALUES | | SKEWED_VALUES | | SORT_COLS | | TABLE_PARAMS | | TAB_COL_STATS | | TBLS | | TBL_COL_PRIVS | | TBL_PRIVS | | TXNS | | | TXN_COMPONENTS | | TYPES | | TYPE_FIELDS | | VERSION | | WRITE_SET | +-+ 57 rows in set (0.00 sec) |

(5) Hive test

Start Hive Cli:

~ / app2/hive/bin$. / hive

Create related tables and write data:

Hive > show databases;OKdefaultTime taken: 0.937 seconds, Fetched: 1 row (s) hive > show tables;OKTime taken: 0.059 secondshive > create table person > (> id int, > name string >); OKTime taken: 0.284 secondshive > insert into person values Omit submitting MapReduce job information... MapReduce Jobs Launched:Stage-Stage-1: Map: 1 HDFS Read: 4089 HDFS Write: 79 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOKTime taken: 17.54 secondshive > select * from person;OK1 xpleafTime taken: 0.105 seconds, Fetched: 1 row (s) 3.2.2 Spark Thirftserver Integration Hive

A description of this part of the official document:

Configuration of Hive is done by placing your hive-site.xml, core-site.xml and hdfs-site.xml files in conf/.

In fact, you put the configuration files core-site.xml and hdfs-site.xml of Hive's configuration file hive-site.xml,Hadoop into the configuration directory of Spark.

Start Thirftserver after that:

~ / app2/spark/sbin$. / start-thriftserver.shstarting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to / Users/yeyonghao/app2/spark/logs/spark-yeyonghao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-yeyonghaodeMacBook-Pro.local.out

But as you'll see later, it doesn't start:

$lsof-iRod 10000

Check the startup log and see that the error message is as follows:

Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASS PATH specification, and the name of the driver.981 at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver (AbstractConnectionPoolFactory.java:58) 982 at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool (BoneCPConnectionPoolFactory.java:54) 983 at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources (ConnectionFactoryImpl.java:238) 984. 91 more

That is, the mysql driver cannot be found. You can copy the driver under hive to the jars directory of spark:

Cp ~ / app2/hive/lib/mysql-connector-java-5.1.39.jar ~ / app2/spark/jars/

Then start it again, and when you look at the log, you find that it still reports an error:

Caused by: MetaException (message:Hive Schema version 1.2.0 does not match metastore's schema version 2.1.0 Metastore is not upgraded or corrupt)

For the reason, see the jar package provided in the spark jars directory:

~ / app2/spark/jars$ ls hive-

Hive-beeline-1.2.1.spark2.jar hive-cli-1.2.1.spark2.jar hive-exec-1.2.1.spark2.jar hive-jdbc-1.2.1.spark2.jar hive-metastore-1.2.1.spark2.jar

Obviously they are all versions of hive 1.x.

But the Hive I installed is version 2.x, and there is a VERSION table in mysql that holds its version 2.1.0.

Reference: https://yq.aliyun.com/articles/624494

Here I turn off version verification in the hive-site.xml of spark:

Hive.metastore.schema.verification false

After the modification is completed, you can see the log information that started successfully:

2019-07-13 17:16:47 INFO ContextHandler:781-Started o.s.j.s.ServletContextHandler@1774c4e2 {/ sqlserver/session,null,AVAILABLE,@Spark} 2019-07-13 17:16:47 INFO ContextHandler:781-Started o.s.j.s.ServletContextHandler@f0381f0 {/ sqlserver/session/json,null,AVAILABLE,@Spark} 2019-07-13 17:16:47 INFO ThriftCLIService:98-Starting ThriftBinaryCLIService on port 10000 with 5.. 500 worker threads

Take a look at the port number:

~ / app2/spark/sbin$ lsof-i:10000COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAMEjava 5122 yeyonghao 317u IPv6 0x3cb645c07a5bcbbb 0t0 TCP *: ndmp (LISTEN) 3.2.3 launch beeline for testing

Here we start beeline to operate:

~ / app2/spark/bin$. / beelineBeeline version 1.2.1.spark2 by Apache Hive

Previously, we created a person table in Hive. If the integration with Hive is successful, you should also see it here. Since the same metastore is shared, view the data in it as follows:

Beeline >! connect jdbc:hive2://localhost:10000Connecting to jdbc:hive2://localhost:10000Enter username for jdbc:hive2://localhost:10000:Enter password for jdbc:hive2://localhost:10000:2019-07-13 17:20:02 INFO Utils:310-Supplied authorities: localhost:100002019-07-13 17:20:02 INFO Utils:397-Resolved authority: localhost:100002019-07-13 17:20:02 INFO HiveConnection:203-Will try to open client transport with JDBC Uri: jdbc:hive2://localhost: 10000Connected to: Spark SQL (version 2.3.3) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ0: jdbc:hive2://localhost:10000 > show tables +-- + | database | tableName | isTemporary | +-+ | default | person | false | +-+-- -+ 1 row selected (0.611 seconds) 0: jdbc:hive2://localhost:10000 > select * from person +-- + | id | name | +-- + | 1 | xpleaf | +-+ 1 row selected (1.842 seconds)

As you can see, no problem, and then look at port 4040:

Here we create another person2 table:

0: jdbc:hive2://localhost:10000 > create table person20: jdbc:hive2://localhost:10000 > (0: jdbc:hive2://localhost:10000 > id int,0: jdbc:hive2://localhost:10000 > name string0: jdbc:hive2://localhost:10000 >); +-+-+ | Result | +-- + No rows selected (0.548 seconds)

At this point, you can take a look at the mysql database where the metadata information is saved, and the data table information we created is saved in the tbls table:

Mysql > select * from tbls +- -+-+ | TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | +- -+-+ | 1 | 1563008351 | | 0 | yeyonghao | 0 | 1 | person | MANAGED_TABLE | NULL | NULL | | 6 | 1563009667 | 1 | 0 | yeyonghao | 0 | 6 | person2 | MANAGED_TABLE | NULL | NULL | +-+- -+- -+ 2 rows in set (0.00 sec)

You can see that there is already information about the person2 table, indicating that the integration of Thirftserver and Hive has been successful.

3.3 further: Spark Thirft JDBCServer On Yarn With Hive

In fact, Hive has been integrated in the previous 3.2, and then integrated with Yarn here.

3.3.1 principles of deploying Thirftserver to Yarn

As mentioned earlier, Thirftserver is essentially an Application of Spark, so we can also specify master as yarn when we start Thirftserver. In fact, we deploy the Spark Application of Thirftserver to Yarn and let Yarn allocate resources for it and schedule the execution of its jobs.

The official document states this as follows:

His script accepts all bin/spark-submit command line options, plus a-- hiveconf option to specify Hive properties. You may run. / sbin/start-thriftserver.sh-- help for a complete list of all available options. By default, the server listens on localhost:10000. You may override this behaviour via either environment variables, i.e.:.

That is, the parameters received by the saprk-submit script can also be received by start-thriftserver.sh.

3.3.2 specify master as yarn to start Thirftserver

Now, use the following startup method:

~ / app2/spark/sbin$. / start-thriftserver.sh-- master yarnstarting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 Logging to / Users/yeyonghao/app2/spark/logs/spark-yeyonghao-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-yeyonghaodeMacBook-Pro.local.outfailed to launch: nice-n 0 bash / Users/yeyonghao/app2/spark/bin/spark-submit-- class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-- name Thrift JDBC/ODBC Server-- master yarn Spark Command: / Library/Java/JavaVirtualMachines/jdk1.8.0_181.jdk/Contents/Home/bin/java-cp / Users/yeyonghao/app2/spark/conf/:/Users/yeyonghao/app2/spark/jars/*-Xmx1g org.apache.spark.deploy.SparkSubmit-master yarn-class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-name Thrift JDBC/ODBC Server spark-internal = = Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. At org.apache.spark.deploy.SparkSubmitArguments.validateSubmitArguments (SparkSubmitArguments.scala:288) at org.apache.spark.deploy.SparkSubmitArguments.validateArguments (SparkSubmitArguments.scala:248) at org.apache.spark.deploy.SparkSubmitArguments. (SparkSubmitArguments.scala:120) at org.apache.spark.deploy.SparkSubmit$.main (SparkSubmit.scala:130) at org.apache.spark.deploy.SparkSubmit.main (SparkSubmit.scala) full log in / Users/yeyonghao/app2/spark/logs/spark-yeyonghao-org. Apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-yeyonghaodeMacBook-Pro.local.out

You can see the error message, the key is:

When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

Add directly to the spark-env.sh:

HADOOP_CONF_DIR=/Users/yeyonghao/app/hadoop/etc/hadoopYARN_CONF_DIR=/Users/yeyonghao/app/hadoop/etc/hadoop

After that, there will be no problem with starting it. You can see from the log that the essence is to use Thirftserver as a Spark Application, and then submit it to Yarn for scheduling:

73 2019-07-13 17:35:22 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 74 2019-07-13 17:35:22 INFO Client:54-75 client token: Nash A 76 diagnostics: n diagnostics A 77 ApplicationMaster host: n ApplicationMaster RPC port A 78 ApplicationMaster RPC port:-79 queue: default 80 start time: 1563010521752 81 final status: UNDEFINED 82 tracking URL: http://192.168.1.2 8088/proxy/application_1563008220920_0002/ 83 user: yeyonghao 84 2019-07-13 17:35:23 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 85 2019-07-13 17:35:24 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 86 2019-07-13 17:35:25 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 87 2019-07-13 17:35:26 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 88 2019-07-13 17:35:27 INFO Client:54-Application report for application_1563008220920_0002 (state: ACCEPTED) 89 2019-07-13 17:35:28 INFO YarnClientSchedulerBackend:54-Add WebUI Filter. Org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map (PROXY_HOSTS-> 192.168.1.2, PROXY_URI_BASES-> http://192.1 68.1.2:8088/proxy/application_1563008220920_0002), / proxy/application_1563008220920_0002 90 2019-07-13 17:35:28 INFO JettyUtils:54-Adding filter org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter to / jobs, / jobs/json, / jobs/job, / jobs/job/json / stages, / stages/json, / stages/stage, / stages/stage/json, / stages/pool, / stages/pool/json, / storage, / storage/json, / storage/rdd, / storage/rdd/json, / environment, / environment/json, / executors, / executors/json, / executors/threadDump, / executors/threadDump/json, / static, /, / api, / jobs/job/kill, / stages/stage/kill. 91 2019-07-13 17:35:28 INFO YarnSchedulerBackend$YarnSchedulerEndpoint:54-ApplicationMaster registered as NettyRpcEndpointRef (spark-client://YarnAM) 92 2019-07-13 17:35:28 INFO Client:54-Application report for application_1563008220920_0002 (state: RUNNING) 93 2019-07-13 17:35:28 INFO Client:54-94 client token: n diagnostics 95 diagnostics: n diagnostics 96 ApplicationMaster host: 192.168.1.2 97 ApplicationMaster RPC port: 0.98 queue: Default 99 start time: 1563010521752 100 final status: UNDEFINED 101tracking URL: http://192.168.1.2:8088/proxy/application_1563008220920_0002/ 102user: yeyonghao 2019-07-13 17:35:28 INFO YarnClientSchedulerBackend:54-Application application_1563008220920_0002 has started running.

You can check port 8088 and take a look at the information of this Application:

After we connect to the Thirftserver, the operations performed will be converted into the corresponding Job in the Spark (note that it is the Job of Spark Application, which may contain multiple Stage, and the Stage may contain multiple Task, and students who do not understand this part can learn about Spark Core first), and the resource scheduling is completed by Yarn.

If you access the original port 4040, you will jump to Yarn's monitoring of the Application, but the interface is still familiar with Spark UI, as follows:

Later, when you connect the operation through beeline or JDBC, you can see the operation information of job here:

You can also see the session message:

4 Summary: do you need to use Spark Thirft JDBCServer?

In a production environment, you may see more of the integrated use of Spark SQL and Hive or the Spark Thirft JDBCServer On Yarn With Hive mentioned in 3. 3, in either case, it is considered for the following core purposes:

1. Support the use of SQL to analyze the data; 2. Can carry on the distributed data storage based on HDFS; 3. Faster

When considering whether you need to provide JDBC to connect, you can consider using Spark Thirft JDBCServer.

5 more SQL analysis big data platform

In addition to the above mentioned, big data platform or framework that can actually use SQL for analysis, as well as Storm SQL and Flink SQL, are very popular and popular at present.

In addition, big data frameworks such as Elasticsearch and Druid, which integrate data storage and analysis, are also very popular and support SQL queries.

According to the author's long-time use and understanding of Elasticsearch, although Elasticsearch was born as a full-text retrieval framework for standard Solr, it will gradually be found that with the iterative update of the version, Elasticsearch has added more and more data analysis operators, and even more directly added the data analysis and query ability of SQL after version 6.0. although this ability is still relatively weak, with the passage of time. Elasticsearch SQL will definitely become more powerful!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report