What is Azkaban? 04/10 Update SLTechnology News&Howtos

What is Azkaban?

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article will explain what Azkaban is for you in detail. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

1. What is Azkaban

We should all have encountered such a scenario in our work: there is a task that can be divided into several smaller tasks, which can be divided because small tasks can be carried out concurrently, such as a command executed by a shell script, and large task A can be divided into four subtasks (script): B, C, D and E, while B and C can be carried out at the same time. D depends on the output of B and C. E also depends on the output of D, so our general practice may be to open two terminals to execute B and C at the same time, and then execute D after both execution is completed, and then execute E. We need to participate in the whole execution process, but the whole execution process is similar to a directed acyclic graph. The execution of each subtask can be regarded as a flow of the whole task, and we can start from the nodes with no degree of entry at the same time. Any node with no flow direction (no path between the two nodes) can be executed in parallel. Human control is inevitably a little inadequate (because many tasks need to be performed late at night, usually we write scripts and set up cron), all we need is a workflow scheduler. Azkaban is to accomplish this task (in fact, it is mainly used to support the tasks of the hadoop ecosystem), it is implemented by Linkedin and open source, mainly used in a workflow to run a set of jobs and processes in a specific order, its configuration is through a simple way of key:value pair, through the configuration of dependencies to set dependencies, this dependency must be acyclic, otherwise it will be regarded as invalid workflow. Azkaban has the following functional features:

Web user interface

Easy to upload workflow

It is easy to set the relationship between tasks

Scheduling workflow

Authentication / authorization (work on permissions)

Ability to kill and restart the workflow

Modular and pluggable plug-in mechanism

Project Workspace

Logging and auditing of workflows and tasks

I think these are the functions that some mainstream workflow schedulers should support. I think azkaban's web page does a good job, which can greatly reduce administrative costs. The type of task scheduling it supports is based on plug-ins, which allows us to implement our own plug-ins to meet specific requirements. In addition, it can also send email when the task is completed, failed, successful, support SLA settings and other functions, on the whole, the function is still very powerful.

2. Installation and deployment

Azkaban is divided into three components: mysql server, web server and executor server, in which mysql is used to store some projects and execution plans (attribute information of all tasks, execution plan, execution results and output), each execution and other information; web server uses Jetty to provide web services, which is convenient for users to manage through the web page. The execution server is responsible for the submission and execution of specific workflows, and multiple execution servers can be started, which coordinate the execution of tasks through the mysql database.

First of all, you need to download each module from the official website, all in binary installation package format, of course, you can also use source code to compile, download address: http://azkaban.github.io/downloads.html for the next installation process, please refer to: http://blog.javachen.com/2014/08/25/install-azkaban/ because the web client is accessed through https, so you need to create a keystore certificate file. Use the command: keytool-keystore keystore-alias jetty-genkey-keyalg RSA, follow the prompts to enter the required information, the final input "key password" can be the same as the KeyStore password, and need to modify the properties of the web server in the configuration file azkaban.properties of the Jetty server, where jetty.keystore=keystore jetty.password=redhat jetty.keypassword=redhat jetty.truststore=keystore jetty.trustpassword=redhat is set to the information of the certificate file you generated. You can then enter https://ip:8443 in the browser to access azkaban (the login username and password are set in the user configuration file of the web server, where we are using admin).

3. Testing

Here we do a simple test, because azkaban natively supports the shell command (so it can also support shell scripts and other script programs such as python), so you can use a simple shell command to test, we create four subtasks, each of which is configured with a task name .job file. They are configured as follows: test.job

Type=command

Command=sleep 3

Command.1=echo "Hello World"

Start.job

Type=command

Command=sleep 5

Command.1=echo "start execute"

Sleep.job

Type=command

Dependencies=test, start

Command=sleep 10

Finish.job

Type=command

Dependencies=sleep

Command=echo "finish" here identifies the tasks that the task depends on by the dependencies attribute. There can be one or more tasks. Through the "," division, the type of these tasks are all command,azkaban and support other types of commands, and some need to install plug-ins to support them. Then we put the four job files in a directory and compress them into a zip file. On the home page of Azkaban's web interface, we can create a new workflow through the "Create Project" button. After entering the necessary information, we will enter the project interface. We can upload the task flow we want to execute through upload, and we can repeat upload to overwrite it. However, the execution results of the previous task flow will not be overwritten. If there is a problem with the configuration of the workflow (such as interdependence), the upload will not succeed, but you will not see the prompt. After waiting for the compressed file to be uploaded successfully, we can view the dependency graph of each task through the interface:

You can start an execution of a workflow through the "Execute Flow" button, and after clicking it, you will enter the configuration interface, including "Flow View", "Notification", "Failure Options", "Concurrent", "Flow Parameters", as well as the Schedule button in the lower left corner, where you can set the timing of the workflow execution. Note that this is the setting that needs to be set for each workflow execution. Currently, we do not see how to save the historical settings. Of course, if you want to repeat the previous settings, you can find the previous execution and run it again (at this time, you still need to enter the configuration page, but the configuration of that run will be saved). What needs to be noted is the configuration in "Failure Options" and "Concurrent", which respectively configure the processing after the failure of a task in the workflow and the processing of multiple execution flows of the project (multiple Execute) if there is parallelism. Instead of configuring it here, we will directly execute the command: after submission, we will prompt the id for this execution (I think it would be better to mark it with a recognizable string here). This id is globally unique, that is, each execution of multiple project will increment a new exec id. After the execution is completed, you can view the execution results of each task flow and each subtask through the web interface.

Under the Graph tag, you can check the execution of each task and which task you are currently executing. Flow Log will output the running log of the workflow in real time. Click on each subtask to check the running status of the subtask and the log information of the real-time output. Generally speaking, it is very convenient.

The concepts involved here: project, flow, and job, first of all, a project is a whole of tasks to be executed, it can contain multiple flow, each project can upload a .zip file, the flow is independent of each other, but there is a total flow, and this flow may refer to other flow as part of its execution (equivalent to a child job of the total flow, but this job is a flow). Each flow contains multiple job, and these job are independent of each other. Through setting dependencies in the job file, the ending job of each flow can be used as the identity (flow name) of this flow. In this way, we can add a flow as a job to another flow: jobGroup.job type=flow.

Flow.name=finish

Dependencies=realStart finish is the identity of the previously defined flow (because it is the terminating job). As a job, this flow can set other dependencies. Here is a task dependency graph containing a child flow:

I think the reason why it is designed like this is to separate each flow and facilitate the reuse of flow.

4. User management

There is the concept of users and user groups in azkaban. The configuration information of users and user groups and permissions is saved in the configuration file azkaban-users.xml. The authentication method is implemented by azkaban.user.XmlUserManager. The specific configuration can be configured in azkaban.properties (under the conf of web server):

ParameterDefaultuser.manager.classazkaban.user.XmlUserManageruser.manager.xml.fileazkaban-users.xml

We can configure three types of content in azkaban-users.xml: user, group and role,user items can configure username, password, roles and group information to configure user name, password, user permissions and groups to which they belong, respectively; group items can be configured name and roles, which are used to configure the group name and permissions used by this group, respectively; role defines permission information, and you can configure name and permissions to represent rule names and assigned permission information, respectively. The permissions supported by azkaban include:

PermissionsValuesADMIN can do task things, including adding and modifying permissions to other users. READ can only access the contents and log information of each project. WRITE can upload and modify the properties of tasks in the created project. It can delete any projectEXECUTE that allows users to execute any task flow. SCHEDULE allows users to add and delete scheduling information of any task flow CREATEPROJECTS allows users to create new project even if the creation of project is prohibited.

The permission settings here are not refined to every user in every project, and the permissions that each user has can perform the same operation under each project. In addition, the permission information between users and user groups is not very clear. If you use user groups as the allocation unit of permissions (that is, all users under a user group have the same permissions), it is a bit superfluous for each user to specify permissions again.

This is the end of this article on "what is Azkaban?". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.