How to write YARN applications 07/19 Update SLTechnology News&Howtos

How to write YARN applications

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the knowledge of "how to write YARN applications". Many people will encounter this dilemma in the operation of actual cases, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1. Summary

Drill is an open source SQL query engine under Apache that can be used to explore big data. It is originally designed to support the high performance analysis of big data and to support the industry standard query language ANSI SQL.

Prior to Drill 1.13, Drill only supported stand-alone cluster deployment, and after successful deployment, a daemon called Dirllbit ran on each node. Starting with version 1.13, Drill supports integration with YARN to manage resources. With YARN, Drill becomes a long process running on YARN. When you start Drill, YARN automatically deploys the Drill software to each node, avoiding the hassle of installing Drill on each node. In addition, resource management is simplified because YARN is sensitive to the resources used by Drill.

All YARN distributions currently provide settings for memory and CPU (called "vcores" by YARN), and some distributions also provide settings for disks. For memory, when you deploy Drill on YARN, you will configure the memory to be used by Drill and tell YARN. In addition, Drill will use all available disks and CPU, and of course you can enable Linux cgroup to limit Drill's use of CPU to match the vcores allocation of YARN.

To facilitate the explanation of deploying Drill under YARN, let's briefly introduce the core concepts of YARN.

2. Core concepts of YARN

YARN, whose full name is Yet Another Resource Negotiator (another resource coordinator), is a new Hadoop resource manager. It is a general resource management system, which can provide unified resource management and scheduling for upper-level applications.

2.1 Core components

ResurceManager (RM): a global resource manager responsible for the management and allocation of resources throughout the system. It mainly consists of two components: scheduler (Scheduler) and application manager (Applications Manager,ASM).

ApplicationMaster (AM): each application submitted by the user contains an AM, and the main features include

Negotiate with RM Scheduler to get resources (expressed in Container)

Communicate with NM to start / stop tasks

Monitor the running status of all tasks and re-request resources for the task to restart the task when the task fails

Container: Container is a resource abstraction in YARN, which encapsulates multidimensional resources on a node, such as CPU, memory, disk, network, and so on. When AM requests a resource from RM, the resource returned by RM to AM is represented by Container. YARN assigns a Container to each task, and the task can only use the resources described in that Container. Container is a dynamic resource division unit that is automatically generated according to the needs of the application.

NodeManager (NM): NM is the resource and task manager on each node. On the one hand, it regularly reports the resource usage and Container running status of this node to RM; on the other hand, it accepts and processes various requests such as Container start / stop from AM.

Client (Client): an instance in a cluster that can submit an application to RM and specifies the type of AM required to execute the application.

2.2 YARN workflow

When a user submits a task to YARN, YARN runs the task in two phases: the first stage is to start AM. The second phase is for AM to create a task, request resources for it, and monitor its entire run until the run is complete. The details are as follows:

Users submit tasks to YARN, including AM programs, commands to start AM, and so on.

RM assigns the first Container to the application, usually called 001, and communicates with the corresponding NM, asking it to start the application's AM in this Container.

AM first registers with RM so that users can view the running status of the task directly through RM, and then it will request resources for each task and monitor its running status until the end of the run, that is, repeat step 4x7.

AM applies for and receives resources from RM through RPC protocol by polling.

Once the AM requests a resource, it communicates with the corresponding NM, asking it to start the task.

After NM sets up the running environment for the task (including environment variables, JAR packages, binary programs, and so on), it writes the task startup command into a script and starts the task by running the script.

Each task reports its status and progress to AM through RPC protocol, so that AM can keep abreast of the running status of each task at any time, so that it can restart the task when it fails.

When the application is finished, AM logs out to RM and closes itself.

2.3 how to write YARN applications

Write a client

/ / initialize and launch a YarnClientConfiguration yarnConfig = new YarnConfiguration (getConf ()); YarnClient client = YarnClient.createYarnClient (); client.init (yarnConfig); client.start ();... / create an application YarnClientApplication app = client.createApplication (); GetNewApplicationResponse appResponse = app.getNewApplicationResponse ();... / set the application submission context ApplicationSubmissionContext appContext = app.getApplicationSubmissionContext (); appContext.setApplicationId (appResponse.getApplicationId ()); appContext.setApplicationName (config.getProperty ("app.name")) AppContext.setApplicationType (config.getProperty ("app.type")); / / set am container startup context ContainerLaunchContext amContainer = Records.newRecord (ContainerLaunchContext.class); amContainer.setLocalResources (amLocalResources); amContainer.setEnvironment (amEnvironment); amContainer.setCommands (Collections.singletonList (amCommand.toString (); / / submit application client.submitApplication (appContext)

Write ApplicationMaster (AM)

/ / initialize AMRMClientAsyncYarnConfiguration yarnConfig = new YarnConfiguration (); AMRMClientAsync amrmClientAsync = AMRMClientAsync.createAMRMClientAsync (5000, new AMRMCallbackHandler ()); amrmClientAsync.init (yarnConfig); amrmClientAsync.start (); / / initialize NMClientAsyncYarnConfiguration yarnConfig = new YarnConfiguration (); NMClientAsync nmClientAsync = NMClientAsync.createNMClientAsync (new NMCallbackHandler ()); nmClientAsync.init (yarnConfig); nmClientAsync.start (); / / Register ApplicationMaster (AM) amrmClientAsync.registerApplicationMaster (thisHostName, 0, ");... / add ContainerRequestamrmClientAsync.addContainerRequest (containerRequest) ... / launch container nmClientAsync.startContainerAsync (container, containerContext); / / log out of amrmClientAsync.unregisterApplicationMaster (appStatus, appMessage, null)

This is only a brief introduction to the concept of YARN and how to write YARN applications. For details, please refer to Apache Hadoop YARN.

3. Drill-on-YARN deploys 3.1 Drill-on-YARN components

Drill distribution: Drill-on-YARN uploads this distribution to a distributed file system (such as HDFS). YARN downloads it to each work node (that is, the node where Node Manager is located)

Drill site directory: a directory that contains Drill configuration questions and custom jar packages. Drill-on-YARN will copy it to each worker node

Configuration: a configuration file that tells Drill-on-YARN how to manage the Drill cluster. This file is independent of the configuration file of drill itself.

Drill-on-YARN client: Drill-on-YARN client, which provides commands such as start, stop, monitor, etc.

Drill Application Master (AM): used to interact with YARN, including requesting resources, starting Drillbits, and so on. AM also provides a web interface for managing Drill clusters.

Drillbit: the Drill daemon running on each node

3.2. Deployment steps

YARN starts the application through the client. For Drill, it is the Drill-on-YARN client. The client can be on any machine, as long as the machine has both Drill and Hadoop software. When you deploy Drill using YARN, you only need to install Drill,Drill-on-YARN on the client computer and it will be automatically deployed to other nodes. It is important to note that when you deploy Drill without YARN, you will generally put its configuration files and custom code in the Drill directory, but when running under YARN, it is recommended that all configuration and custom code be placed in a directory called site, do not change anything in the Drill directory.

Next, the deployment steps are described in detail:

Deployed environment

There is no need to repeat the deployment of jdk, zookeeper and hadoop. Remember to set up JAVA_HOME and HADOOP_HOME.

JDK8+

Zookeeper cluster

Hadoop cluster

Create a directory to place the downloaded Drill distribution package

Export DRILL_DIR=/path/to/drillmkdir-p $DRILL_DIRcd $DRILL_DIR

Description: after executing the above command, the directory is / path/to/drill

Download the Drill distribution package, use apache-drill-1.14.0.tar.gz here, decompress it after download, and emphasize again that the current directory is / path/to/drill

Export DRILL_NAME=apache-drill-1.14.0tar-xzf $DRILL_NAME.tar.gzexport DRILL_HOME=$DRILL_DIR/$DRILL_NAME

Description: DRILL_NAME is very important, and it has something to do with the name when you start it later.

Create a site directory and place configuration files and custom code in it

Export DRILL_SITE=$DRILL_DIR/sitemkdir-p $DRILL_SITEcp $DRILL_HOME/conf/drill-override-example.conf $DRILL_SITE/drill-override.confcp $DRILL_HOME/conf/drill-on-yarn-example.conf $DRILL_SITE/drill-on-yarn.confcp $DRILL_HOME/conf/drillenv.sh $DRILL_SITE

Description:

For custom code, it is usually packed into a jar package and placed in DRILLSITE/jars. For example, a custom udf can be placed in DRILLSITE/jars. For example, custom udf can be placed in DRILL_SITE/jars/3rdparty.

Do not copy the entire drill-override-example.conf file, just copy the configuration you need, and then modify it

Modify $DRILL_SITE/drill-override.conf

In general, the configurations that may need to be modified are: cluster-id, zk, http, rpc. Here, I will only modify cluster-id and zk

Drill.exec: {cluster-id: "drillbits1" zk: {connect: "11.167.47.76 delay 2181", root: "drill", refresh: 500, timeout: 5000, retry: {count: 7200, delay: 7200}

Modify $DRILL_SITE/drill-on-yarn.conf

# Drillbit resource configuration drillbit: {heap: "4G" # Java heap size max-direct-memory: "8G" memory-mb: 12288 # memory used per unit of MB,container, generally speaking, equal to heap+max-direct-memory, but it is recommended to be greater than this value: vcores: number of 4 # cpu} # Drillbit cluster configuration cluster: [{name: "mypool" type: "basic" # optional are basic and labeled Basic means to start drillbits on any available container on the YARN cluster Labeled launches drillbits count in a set of specific labeled containers: 1 # number of YARN containers started}] # configure the location of the drill distribution package drill-install: {client-path: "/ path/to/drill/apache-drill-1.14.0.tar.gz" # dir-name: "drill"} # set the distributed file system location dfs: {connection: "hdfs: / / ip:port/ "dir:" / user/drill "} # Drill-on-YARN Web interface configuration drill.yarn: {http: {port: 8048} # Drill-on-YARN Web interface security configuration drill.yarn.http: {auth-type:" simple "user-name:" drill "/ / Note Drill-on-yan-example.conf defaults to user_name, which is wrong. Change it to user-name password: "drill"}

Description:

Complete configuration is attached

Drill.yarn: {app-name: "Drill-on-YARN" dfs: {connection: "hdfs://11.162.91.196:9000/" app-dir: "/ users/drill"} yarn: {queue: "default"} drill-install: {client-path: "/ home/admin/drill/apache-drill-1.14.0.tar.gz" # dir-name: " Drill "# library-path:" / opt/libs "} am: {heap:" 450m "memory-mb: 512 # node-label-expr:" drill-am "} http: {port: 8048 # ssl-enabled: true auth-type:" simple "user-ame:" drill "password:" drill "rest-key="} drillbit: {heap: " 3G "max-direct-memory:" 1G "code-cache:" 1G "memory-mb: 4096 vcores: 2 # disks: 3 classpath:"} cluster: [{name:" drill-group1 "type:" basic "count: 3}]}

With regard to heap and max-direct-memory in Drillbit resource configuration, when deployed under non-YARN, the DRILLHOME/conf/drillenv.sh file is modified, but when deployed under YARN, the DRILLHOME/conf/drillenv.sh file is modified, but when deployed under YARN, DRILL_SITE/drill-on-yarn.conf is modified. However, if you have already configured it in drillenv.sh, drillenv.sh takes precedence.

Drillbit cluster group configuration, although it is a list, only one is supported.

Dir-name specifically states that when you extract the files from client-path and the directory is apache-drill-1.14.0, you do not need to configure dir-name. If not, please match the directory name of the extracted files.

The auth-type securely configured in the Web interface supports both simple and drill. To use simple, you need to specify a user name and password. Use drill to describe the authentication system of drill.

Start

$DRILL_HOME/bin/drill-on-yarn.sh-- site $DRILL_SITE start

Next, you will see the startup log

Connecting to DFS... Connected.Using existing Drill archive in DFS: / users/drill/apache-drill-1.14.0.tar.gzUploading site directory / home/admin/drill/apache-drill-1.14.0/bin/../../site to / users/drill/site.tar.gz... Uploaded.Loading YARN Config... Loaded.Application ID: application_1533475543014_0005Launching Drill-on-YARN...Tracking URL: http://dtshow011162091196.zth:8088/proxy/application_1533475543014_0005/Application Master URL: http://11.163.210.105:8048/

As you can see from the above command, the site.tar.gz typed into the apache-drill-1.14.0.tar.gz and site directories will be uploaded to HDFS first, then the configuration of YARN will be loaded, and finally Drill will be started.

In addition to startup commands, drill-on-yarn.sh also provides status, stop, resize, clean commands, such as status

Application ID: application_1533475543014_0005Application State: RUNNINGHost: dtshow011163210105.zth/11.163.210.105Queue: defaultUser: adminStart Time: 2018-08-19 20:51:55Application Name: Drill-on-YARNTracking URL: http://dtshow011162091196.zth:8088/proxy/application_1533475543014_0005/AM State: LIVETarget Drillbit Count: 3Live Drillbit Count: 3Unmanaged Drillbit Count: 0Blacklisted Node Count: 0Free Node Count: 0For more information, visit: http://11.163.210.105:8048/

After launching successfully, you can access http://11.163.210.105:8048/, as shown below:

The username and password are the previously configured drill and drill. In addition, this page provides the following features:

Now that you have successfully deployed Drill on YARN, you can also perform query tests by visiting Drill's Web UI, as shown below:

Overview of cluster status

Complete startup configuration

List of running Drillbits

Simple operation to adjust the cluster

A stopped, kill, failed Drillbits history page that can be used to diagnose problems

That's all for "how to write YARN applications". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.