What is the Analysis and solution of HIVE Job Management 04/27 Update SLTechnology News&Howtos

What is the Analysis and solution of HIVE Job Management

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

HIVE homework management analysis and what is the solution, I believe that many inexperienced people do not know what to do, so this paper summarizes the causes of the problem and solutions, through this article I hope you can solve this problem.

Hive task management

For hive task presentation, you need to associate id with mr id, and to kill tasks, you need to kill all tasks that belong to this hive statement.

A related knowledge 1.1 basic knowledge

Hive is a data warehouse tool based on Hadoop, which can map structured data files to a database table and provide SQL-like query functions. The essence is to convert SQL into MapReduce program.

1.2 Thrift Services

Thrift is a software framework developed by facebook, which is used to develop extensible and cross-language services. Hive integrates this service and allows different programming languages to call hive interfaces.

1.3 hiveServer/HiveServer2

Both allow the remote client to use multiple programming languages. Through HiveServer or HiveServer2, the client can operate on the data in Hive without starting CLI, and allow the remote client to submit a request to hive using a variety of programming languages such as java,python to retrieve the result (hiveserver is no longer supported from hive0.15).

HiveServer or HiveServer2 are based on Thrift, but HiveSever is sometimes called Thrift server, while HiveServer2 does not.

Why do you need HiveServer2 when HiveServer already exists?

This is because = = HiveServer cannot handle concurrent requests from more than one client. This is due to the limitation caused by the Thrift interface used by HiveServer and cannot be corrected by modifying the code of HiveServer. So the problem was solved by rewriting the HiveServer code in the Hive-0.11.0 version to get HiveServer2.

HiveServer2 supports multi-client concurrency and authentication, which provides better support for open API clients such as JDBC and ODBC.

1.4 implementation principle 1.5 user interface Hive command line mode (CLI)

Command line interface, CLI can be run in two modes, one is a simple client, directly connect to driver;, and the other is that the connection hiveserver1;hiveserver1 has been replaced by hiveserver2 and removed in version 1.0.0. Because of security and other reasons, officials hope to use beeline instead of cli; or directly modify the underlying implementation of cli to achieve seamless modification. Https://issues.apache.org/jira/browse/HIVE-10511 https://cwiki.apache.org/confluence/display/Hive/Replacing+the+Implementation+of+Hive+CLI+Using+Beeline

Web mode of Hive (WUI)

The hive client provides a way to access the services provided by hive through web pages. This interface corresponds to the hwi component (hive web interface) of hive, which starts the hwi service before using it. = = hwi was removed after version 2.2 because of its less use. = =

Remote Service of Hive (Client)

Communicate with a separate HiveServer2 process through the Thrift protocol.

Beeline

HiveServer2 provides a new command line tool, Beeline, which is a JDBC client based on SQLLine CLI.

There are two modes of Beeline operation, local embedded mode and remote mode. In the case of embedded mode, it returns an embedded Hive (similar to Hive CLI). Remote mode, on the other hand, communicates with a single HiveServer2 process through Thrift protocol.

Second, open source solution 2.1hive falcon

In the development work, submit Hadoop tasks, task operation details, this is what we are concerned about, when the business is not complex, we can use the command tools provided by Hadoop to manage tasks in YARN. When writing Hive SQL, you need to write SQL statements in the Hive terminal to observe the operation of MapReduce. In the long run, it feels very inconvenient. In addition, with the complexity of the business and the increase in the number of tasks, we have a premonition that we are unable to do what we want when we are using this process. At this time, the monitoring system of Hive is particularly important. We need to observe the details of the MapReduce operation of Hive SQL and the relevant status in YARN.

Hive Falcon is used to monitor the submitted tasks in the Hadoop cluster and the details of their running status. The task details in Yarn include task ID, submitter, task type, completion status and other information. Alternatively, you can write Hive SQL and run SQL to see the details of SQL running. You can also view information such as tables and their table structures that exist in the Hive warehouse.

The query of HIVE cannot be associated with the yarn task and does not meet the requirements. = =

2.2 zepplin

To be confirmed

2.3 ambari

To be confirmed

2.4 hue

It is realized by thrift. Specifically, you can download the hue source code, apps/beeswax, using python implementation.

This implementation can only manage queries executed through the hue interface, but cannot be managed by other clients. The relationship between query and mapreduce is achieved by analyzing logs, and thrift has no corresponding interface. Https://groups.google.com/a/cloudera.org/forum/#!topic/hue-user/wSDcTnZJqTg

Journal

2017-04-2613: 54 INFO ql.Driver 24068 (Driver.java:compile (411))-Compiling command (queryId=hadoop_20170426135454_41352e71-5685-48e5-9e4d-c25819669666): select count (*) from t_function2017-04-2613: 54 queryId=hadoop_20170426135454_41352e71 26891 INFO HiveMetaStore.audit (HiveMetaStore.java:logAuditEvent (388))-ugi=hadoop ip=unknown-ip-addr cmd=get_table: db=default tbl=t_function 2017-04-2613: 54 queryId=hadoop_20170426135454_41352e71 37444 INFO ql.Driver (Driver.java:compile) )-Semantic Analysis Completed2017-04-26 13 Driver.java:getSchema 54Switzerland 37677 INFO ql.Driver (Driver.java:getSchema (245))-Returning Hive schema: Schema (fieldSchemas: [field Schema (name:_c0) Type:bigint, comment:null)], properties:null) 2017-04-2613: 54 INFO ql.Driver 38331 (Driver.java:compile)-Completed compiling command (queryId=hadoop_20170426135454_41352e71-5685-48e5-9e4d-c25819669666) Time taken: 14.712 seconds2017-04-2613: 54 Executing command 38332 INFO ql.Driver (Driver.java:execute (1448))-Executing command (queryId=hadoop_20170426135454_41352e71-5685-48e5-9e4d-c25819669666): select count (*) from t_function2017-04-2613: 54Executing command 38334 INFO ql.Driver (SessionState.java:printInfo (927))-Query ID = hadoop_20170426135454_41352e71-5685-48e5-9e4d-c258196696662017-04-2613: 5438335 INFO ql.Driver (SessionState.java:printInfo ( 927)-Total jobs = 12017-04-26 13 Launching Job 54 out of 38516 INFO ql.Driver (SessionState.java:printInfo (927))-Launching Job 1 out of 12017-04-26 1313 SessionState.java:printInfo 5515 INFO exec.Task (SessionState.java:printInfo (927))-Starting Job = job_1493083994846_0010 Tracking URL = http://T-163:23188/proxy/application_1493083994846_0010/2017-04-26 13 INFO exec.Task 55 SessionState.java:printInfo 24860 INFO exec.Task (927)-Kill Command = / opt/beh/core/hadoop/bin/hadoop job-kill job_1493083994846_ 0010 III other schemes

No open source solution has been found to fully implement this feature. next, you need to see if anyone else has done something similar, and posted an article about it.

3.1 Hive SQL running status Monitoring

At present, the data platform is built using Hadoop, in order to facilitate the work of data analysts, using Hive to encapsulate Hadoop MapReduce tasks, we are no longer faced with individual MR tasks, but a piece of SQL statements. When the data platform interacts with HiveServer through an JDBC-like interface, you can only sense the beginning and end of a SQL, and the process is usually long (two factors: data volume and SQL complexity). In some scenarios, users need to know the progress of the execution of this SQL statement, so as to introduce the following questions:

(1) when a SQL statement is executed through the JDBC interface, the SQL statement is converted into several MR tasks, what is the JobId of each MR task, and how to maintain the corresponding relationship between the SQL statement and the MR task?

(2) how to get the running status of MR task through JobClient?

(3) can the above information be obtained through HiveServer?

Http://www.cnblogs.com/yurunmiao/p/4224137.html

Four summary analysis

Through previous research, it is found that whether it is open source projects or other network materials, most of the requirements only achieve the correspondence between hive queries and yarn tasks under specific circumstances, and most of them are written on log to analyze the logs generated by hive to obtain the relationship between queries and yarn tasks.

Various signs show that the situation is not very optimistic, encountered stubbornness; it is difficult to make a query management tool that is compatible with all execution methods, and we need to go deep into the source code to understand.

All query methods

Graph LRHue-- > ThriftThrift-- > DriverCLI-- > DriverHWI-- > DriverBeeline-- > DriverBeeline-- > Thrift

HWI method will be officially removed after version 2.2, and the utilization rate is not high, so it is not supported for the time being.

Thrift supports multi-client concurrency and authentication, which is officially recommended, so Beeline only supports Thrift usage.

Although CLI does not pass Thrift, but because the current project uses more, consider to support.

Through the above analysis, the access methods we want to support are as follows.

Support query mode

Graph LRHue-- > ThriftThrift-- > DriverCLI-- > DriverBeeline-- > Thrift five source code analysis

Through the above analysis, we get our research direction and determine the research plan. It is divided into two parts, CLI mode task management and Thrift mode task management.

6.1 CLI mode task management

We know that in cli mode, we can only execute one query at a time, and during query execution, we can terminate the query through ctrl+c. Ctrl+c sends a sig_int interrupt signal to the process.

By reading the source code, we find that the processing of the interrupt signal is triggered in the processLine function, which handles a single command. Http://blog.csdn.net/lpxuan151009/article/details/7956518 Note: the source code is a branch of hive1.1, where hive-cli project is the corresponding project of CLI. The entry class is org.apache.hadoop.hive.cli.CliDriver.

If (allowInterrupting) {/ / Remember all threads that were running at the time we started line processing. / / Hook up the custom Ctrl+C handler while processing this line interruptSignal = new Signal ("INT"); oldSignal = Signal.handle (interruptSignal, new SignalHandler () {private final Thread cliThread = Thread.currentThread (); private boolean interruptRequested; @ Override public void handle (Signal signal) {boolean initialRequest =! interruptRequested; interruptRequested = true / / Kill the VM on second ctrl+c if (! initialRequest) {console.printInfo ("Exiting the JVM"); System.exit (127th);} / / Interrupt the CLI thread to stop the current statement and return / / to prompt console.printInfo ("Interrupting..." Be patient, this might take some time. "); console.printInfo (" Press Ctrl+C again to kill JVM "); / / First, kill any running MR jobs HadoopJobExecHelper.killRunningJobs (); TezJobExecHelper.killRunningJobs (); HiveInterruptUtils.interrupt ();});}

So far, we have obtained the termination of the CLI query through both the process and the source code. The task solution for this kind of task management can be found in six.

Execution process

Graph LRprocessCmd-- > processLocalCmdprocessLine-- > processCmdexecuteDriver-- > processLinerun-- > executeDrivermain-- > run

ProcessLocalCmd

Driver qp = (Driver) proc; PrintStream out = ss.out; long start = System.currentTimeMillis (); if (ss.getIsVerbose ()) {out.println (cmd);} qp.setTryCount (tryCount); ret = qp.run (cmd). GetResponseCode () If (ret! = 0) {qp.close (); return ret;}

Where the query is actually executed, you can output the relevant information here. The output mode can be log or database, or rpc (rpc)

6.2 Thrift mode task management

By comparison, task management in CLI is relatively simple because only one query can be executed per CLI, while the Thrift pattern supports concurrency, where we refer to the implementation of hiveserver2. Through Thrift mode, Hue,Beeline,Ambari and other clients can connect to hive to execute query tasks, and these query tasks need to be managed uniformly.

Thrift connection

Graph LRHue-- > hiveserver2Beeline-- > hiveserver2Ambari-- > hiveserver2other-- > hiveserver2

We find that the official thrift package mainly implements various operations corresponding to a connection, but does not include the information between connections; that is, the information between connections does not provide an external interface.

The following is a sample code of Thrift. The main function is to query the data of hive after establishing a connection.

Thrift sample code

TSocket tSocket = new TSocket ("Tmur162", 10000); tSocket.setTimeout (20000); TTransport transport = tSocket; TBinaryProtocol protocol = new TBinaryProtocol (transport); TCLIService.Client client = new TCLIService.Client (protocol); transport.open (); TOpenSessionReq openReq = new TOpenSessionReq (); TOpenSessionResp openResp = client.OpenSession (openReq); TSessionHandle sessHandle = openResp.getSessionHandle () TExecuteStatementReq execReq = new TExecuteStatementReq (sessHandle, "show databases"); TExecuteStatementResp execResp = client.ExecuteStatement (execReq); TOperationHandle stmtHandle = execResp.getOperationHandle (); TFetchResultsReq fetchReq = new TFetchResultsReq (stmtHandle, TFetchOrientation.FETCH_FIRST, 100); TFetchResultsResp resultsResp = client.FetchResults (fetchReq); List res=resultsResp.getResults (). GetColumns (); for (TColumn tCol: res) {Iterator it = tCol.getStringVal (). GetValuesIterator () While (it.hasNext ()) {System.out.println (it.next ());}} TCloseOperationReq closeReq = new TCloseOperationReq (); closeReq.setOperationHandle (stmtHandle); client.CloseOperation (closeReq); TCloseSessionReq closeConnectionReq = new TCloseSessionReq (sessHandle); client.CloseSession (closeConnectionReq); transport.close ()

By looking at the source code, we find that the source code is a branch of hive1.1, where the hive-service project is the corresponding project of hiveserver2. The Thrift class is mainly under the org.apache.hive.service.cli.thrift package.

Session&operation management

Graph LRHiveServer2-- > CLIServiceCLIService-- > SessionManagerSessionManager-- > handleToSessionSessionManager-- > OperationManagerOperationManager-- > handleToOperation

SessionManager

Private final Map handleToSession = new ConcurrentHashMap (); private final OperationManager operationManager = new OperationManager ()

OperationManager

Private final Map handleToOperation = new HashMap ()

CLIService.cancelOperation

@ Override public void cancelOperation (OperationHandle opHandle) throws HiveSQLException {sessionManager.getOperationManager () .getOperation (opHandle) .getParentSession () .cancelOperation (opHandle); LOG.debug (opHandle + ": cancelOperation ()");}

As you can see from the above API, you only need to find the OperationHandle of the corresponding operation, and then call CLIService.cancelOperation.

Operation&handler

Graph LRSQLOperation-- > ExecuteStatementOperationExecuteStatementOperation-- > OperationOperationHandle-- > Handlepublic abstract class Handle {private final HandleIdentifier handleId; public Handle () {handleId = new HandleIdentifier ();} public Handle (HandleIdentifier handleId) {this.handleId = handleId;} public Handle (THandleIdentifier tHandleIdentifier) {this.handleId = new HandleIdentifier (tHandleIdentifier);} public HandleIdentifier getHandleIdentifier () {return handleId;} @ Override public int hashCode () {final int prime = 31; int result = 1 Result = prime * result + ((handleId = = null)? 0: handleId.hashCode ()); return result;} @ Override public boolean equals (Object obj) {if (this = = obj) {return true;} if (obj = = null) {return false;} if (! (obj instanceof Handle)) {return false;} Handle other = (Handle) obj If (handleId = null) {if (other.handleId! = null) {return false;}} else if (! handleId.equals (other.handleId)) {return false;} return true;} @ Override public abstract String toString ();}

You can find it through handle id. The interface that stops operation according to handle id needs to be released through the thrift interface.

/ / CancelOperation () / Cancels processing on the specified operation handle and// frees any resources which were allocated.struct TCancelOperationReq {/ / Operation to cancel 1: required TOperationHandle op_handle}

Thrift cancle API already exists; now thrift's API to get all operation is needed; used to show the API to get all session that now needs thrift; used to show

6 final solution 6.1 CLI mode

There is no communication module in the cli program. If the communication module is added and modified greatly, the query is cancelled by locating the process and sending interrupt signals to the process.

By modifying the configuration, the query log is subdivided according to the process number, so that the relationship between the query and the process number can be obtained.

When you want to kill a process, you can send an interrupt message to the specified process of the corresponding host through ssh. It cannot be executed by sending an interrupt signal, because there may be many users executing the query, = = only root can guarantee the success of the interrupt = =, but there are security problems with the use of root permissions, all of which can only be done in rpc mode.

Specific rpc solutions are currently available in hadoop rpc form or thrift mode, polling heartbeat mode or cs collection mode.

Temporary polling heartbeat mode, IPC regular query command, heartbeat interval 1 minute.

Workload: 4-8 working days

Because of the problem with the cli scheme, the best solution may be to modify the driver layer.

Log format

Hive-hadoop-hive-T-162.log.2621@T-162hive-hadoop-hive-T-162.log.28632@T-162hive-hadoop-hive-T-162.log.362@T-162hive-hadoop-hive-T-162.log.5762@T-1626.2 THRIFT mode

Add thrift's API to get all operation and display all queries; call the cancel query API to kill the query.

Workload: 4-8 working days

6.3 DRIVER one for all

Because no matter cli or thrift, you have to go through the driver layer eventually, so the modification in the driver layer can solve all problems in one stop. But what is different from the above two kinds is that the above two kinds only add a shell to the original code and basically do not move the core code, while the driver layer needs to change the logical code to a certain extent, which makes it more difficult.

After reading the above, have you mastered the method of HIVE homework management analysis and solution? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.