In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
Recently, we have met a lot of students who are studying ETL and its tools to complain to us: they are also using Kettle, and the starting point is obviously the same, but why do others keep falling into the trap when they do ETL so quickly and well?
In fact, open source tools like Kettle already cover most of the functions needed for daily work, and the basic needs of the enterprise can be solved by directly deploying a set of tools. But in the process of actual use, we will also find that the kettle is like a smartphone with its own phone text message function, without the cooperation of intelligent App with different functions, it is no different from the old phone that can only answer phone calls.
Today, we will first make a simple evaluation and comparison of one of the more popular "App"-scheduling tools, to help you quickly unlock the new posture of using open source tools to do ETL.
First, why do you need a dispatching system?
Let's start with literacy.
We all know that big data's calculation, analysis and processing are generally composed of multiple task units (Hive, Sparksql, Spark, Shell, etc.), each of which completes specific data processing logic.
There is often a strong dependency between multiple task units, and only when the upstream task is executed successfully can the downstream task be executed. For example, if you get the A result after the end of the upstream task, the downstream task needs to combine the A result to produce the B result, so the start of the downstream task must be after the upstream task successfully runs and gets the result.
In order to ensure the accuracy of data processing results, these tasks must be carried out orderly and efficiently in accordance with the upstream and downstream dependencies. A more basic processing method is to estimate the processing time of each task, calculate the start and end time of each task according to the sequence, and keep the whole system running steadily by running the task regularly.
A complete data analysis task is executed at least once, and this scheduling method can fully meet the needs in the process of low-frequency data processing with less data and simple dependency. However, in enterprise-level scenarios, more tasks need to be executed every day, and if there are a large number of tasks, it will take a lot of time to calculate the start time of the tasks. in addition, if the execution time of the upstream task exceeds the original expected time or runs abnormally, the above processing method will be completely unable to deal with, and will cause repeated waste of manpower and material resources, therefore, for the enterprise data development process. A complete and efficient workflow scheduling system will play a vital role.
II. Comparison of dispatching system tools
After many students get started with ETL work, the first thing they should be exposed to is the command Crontab that regularly executes the program with linux. It is easy to use and runs stably. When the installation of the operating system is completed, this command will be started by default. Easy to use, but also has its own shortcomings, such as when there are too many tasks can not be managed, crontab on the machine, can not backup, hang up will not. Therefore, we do not introduce crontab too much here, mainly for the more mature workflow scheduling tools: Apache Oozie, Azkaban, digital cloud for horizontal evaluation.
1 、 Oozie
Oozie: training Elephant person (dispatching mapreduce). An open source framework based on workflow engine, Oozie needs to be deployed to java servlet to run, mainly for timing scheduling, multi-task scheduling according to the logical order of execution.
Oozie download address: https://oozie.apache.org
It has the following functional features:
Unified scheduling hadoop system common mr task startup, hdfs operation, shell scheduling, hive operation, etc.
Let complex dependencies, time triggers, and event triggers be expressed in XML language to improve development efficiency (this is not necessarily, I hate xml very much, I think it is inefficient.)
A group of tasks is represented by a DAG, expressed graphically, and the flow is clear.
Support multiple task scheduling and complete most hadoop tasks
The program definition supports EL constants and functions, which is rich in expression.
Oozie stipulates that an email notification should be sent after the work is finished.
Azkaban uses Web operations. Oozie supports Web,RestApi,Java API operation
2 、 Azkaban
Azkaban is a batch workflow task scheduler open source by Linkedin. Used to run a set of workflows and processes in a specific order within a workflow. Azkaban defines a KV file format to establish dependencies between tasks and provides an easy-to-use web user interface to maintain and track your workflow.
Azkaban download address: https://azkaban.github.io/downloads.html
It has the following functional features:
Compatible with any version of hadoop
Easy-to-use web interface
Upload of simple workflow
Easy to configure dependencies between tasks
Scheduling workflow
Modular and pluggable plug-in mechanism
Authentication / authorization
Ability to kill and restart the workflow
Email reminders of failure and success
3. Number of living clouds
Based on Qilan Technology's products, Digital 4.0 is deployed in the cloud and provides an one-stop big data tool platform and community for individuals, business owners and independent data application developers. The basic package is forever free! Through the mobile platform, individuals and enterprises can integrate and develop their own multi-source business system data without paying too much attention to the complex installation, tedious configuration and daily operation and maintenance of big data's underlying storage and computing engine, form data assets, and enable them to easily build their own data center in the cloud in their own business scenarios.
Digital Cloud product introduction page: dtcloud.dtwave.com
Digital Cloud online Registration address: shuqi.dtwave.com
The scheduling features of digital cloud are as follows:
Complete the adaptive scheduling of more than 20 data sources: Mysql, Oracle, Hive, HBase, Redis, MongoDB, ODPS, Postgresql, ElasticSearch, API, etc.
Modular and pluggable plug-in mechanism
Support for visual workflow configuration
Support task alarm: email, phone, SMS
Diversification of scheduling types: normal scheduling, empty running, pausing scheduling
Support task priority configuration
Scheduling cycle configuration is simple: just click the mouse
Support assembly between workflow and workflow
Support for workflow test runs
You can do this in the workflow viewing interface: view the code, run the log, rerun, set successfully and rerun downstream, rerun downstream, etc.
Quick positioning of error tasks
(comparison of Oozie, Azkaban and Cyber Cloud functions)
Third, a wave of summary
Apache Oozie is a heavyweight task scheduling system with comprehensive functions, but it will be troublesome to deploy and configure, and it will be difficult to start from crontab to Oozie. Azkaban is a tool between oozie and Crontab, but it is not as secure as Oozie, and if a failure occurs, Azkaban will lose all workflows and Oozie can continue to run. Compared with the above two tools, digital cloud solves the problem of complex configuration and deployment, and is easy to expand. At the same time, it also has more other functions in workflow that are convenient for development and operation and maintenance.
(advantages of digital cloud products)
Of course, Digital Cloud is not only a full-featured workflow scheduling tool. As an one-stop big data platform, it also covers the following functions, whether it is simple ETL work or complex data center construction work. The basic version is forever free! No matter what problem you encounter, you can find customer service to solve a tool that is 100 times better than the experience of open source products. Are you sure you don't want to give it a try?
For more details, please click the link to learn more: dtcloud.dtwave.com
Click here and ↓ will enter the perch cloud.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
Package testWd;import org.openqa.selenium.Alert;import org.openqa.selenium.By;import org.openqa.sele
© 2024 shulou.com SLNews company. All rights reserved.