In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What this article shares with you is about how to use the task scheduling artifact airflow. The editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.
Airflow is an incubation project under apache. It is a very elegant open source scheduling platform written by pure Python. With 8971 stars on github, it is a very popular scheduling tool. Airflow uses DAG (directed acyclic graph) to define workflow, and it is very convenient to configure job dependencies. It is no exaggeration to say that it is more convenient than other task scheduling tools.
Airflow has the following natural advantages:
1. Flexible and easy to use, airflow itself is written by Python, and the definition of workflow is also written by Python, with the characteristics of Python glue, there is no task that cannot be scheduled, with open source code, there is no problem that can not be solved, you can completely modify the source code to meet personalized needs, and more importantly, the code is-human-readable. two。 Powerful, with its own Operators has 15 different types of jobs, but also can be customized Operators, what shell script, python,mysql,oracle,hive, etc., whether not the traditional database platform or big data platform, all of which are not satisfied with the official provision, you can write your own Operators. 3. Elegant, the definition of the job is very simple and clear, based on the jinja template engine is easy to do script command parameterization, web interface is also very-human-readable, who uses who knows. 4. It is very easy to expand, provides a variety of base classes for expansion, and a variety of executors to choose from, among which CeleryExcutor uses message queues to orchestrate multiple work nodes (worker), distributively deploys multiple worker, and airflow can expand infinitely.
5. With a wealth of command tools, you can test, deploy, run, clean, rerun, count and other tasks without even opening a browser and typing commands on the terminal. When you think about how many clicks can be made to deploy a small job on the interface, airflow is really too friendly.
Airflow is free, we can put some common inspection tasks, scheduled scripts (such as crontab), ETL processing, monitoring and other tasks on airflow for centralized management, and even do not have to write monitoring scripts. Job errors will automatically send logs to the mailbox of designated personnel to solve production problems with low cost and high efficiency. However, because there are too few Chinese documents, most of them are not comprehensive enough, so it is not very easy to get started quickly. First of all, we must have some knowledge of Python, read the official documents repeatedly, and understand the scheduling principle. This series of sharing from shallow to deep, gradually refined, trying to lift the veil of airflow for you.
component
From a user's point of view, scheduling work has the following functions:
1. System configuration ($AIRFLOW_HOME/airflow.cfg) 2. Job Management ($AIRFLOW_HOME/dags/xxxx.py) 3. Operation Monitoring (webserver) 4. Call the police (email or SMS) 5. Log view (webserver or $AIRFLOW_HOME/logs/ *) 6. Analysis of batch running time (webserver)
7. Background scheduling Service (scheduler)
Except for SMS, airflow has all the other functions, and we can directly configure database connection to write sql query on airflow webserver to do more flexible statistical analysis.
In addition to the above components, we also need to know some concepts
Some concepts: DAG
Linux's crontab and windows's task schedule, they can configure scheduled tasks or interval tasks, but cannot configure dependencies before the job. DAG in airflow manages job dependencies. DAG's directed acyclic graphs is a directed acyclic graph. Figure 1 below is a simple DAG.
Figure 1:DAG example
In airflow this kind of DAG is achieved by writing Python code, the preparation of DAG is very simple, official provides a lot of examples, after the installation is complete, start webserver to see the source code of the DAG sample (in fact, the python program that defines the DAG object), make a little modification to become your own DAG. The dependency in DAG in figure 1 above can be accomplished with the following three lines of code:
Is it very concise, and it's-human-readable.
Operator-Operators
DAG defines a job flow, and Operators defines the jobs that actually need to be executed. Airflow provides a number of Operators to specify the jobs we need to perform:
BashOperator-executes a bash command or script. SSHOperator-executes remote bash commands or scripts (the same principle as the paramiko module). PythonOperator-executes the Python function. EmailOperator-sends an Email. HTTPOperator-sends a HTTP request. MySqlOperator, SqliteOperator, PostgresOperator, MsSqlOperator, OracleOperator, JdbcOperator, etc. -perform SQL tasks. DockerOperator, HiveOperator, S3FileTransferOperator, PrestoToMysqlOperator, SlackOperator, you know.
In addition to the above Operators can also be easily customized Operators to meet the needs of personalized tasks.
Please stay tuned later on how to use these Operators.
Time zone-timezone
Versions prior to airflow 1.9 use the local time zone to define the task start date, and the timing in the crontab expression in scheduler_interval is also based on the local time zone, but airflow 1.9 and subsequent new versions will use the UTC time zone by default to ensure the independence of airflow scheduling, so as to avoid running confusion caused by different machines using different time zones. If the scheduled tasks are concentrated on one time zone, or on different machines, but using the same time zone, you need to convert the start time of the task and the cron expression, or use the local time zone directly. Currently, the stable version of 1.9 does not support time zone configuration, and subsequent versions will include time zone configuration to meet the need to use local time zones.
Web Server-webserver
Webserver is the interface display of airflow, which can display DAG view, control the start and stop of jobs, clear job status rerun, data statistics, view logs, manage users and data connections, and so on. Not running webserver does not affect the scheduling of airflow jobs.
Scheduler-schduler
The scheduler schduler is responsible for reading the DAG file and calculating its scheduling time. When the trigger condition is met, an instance of the executor is opened to run the corresponding job. It must run continuously. If it is not run, the job will not run batch.
Work Node-worker
When the actuator is CeleryExecutor, you need to open a worker.
Actuator-Executor
Actuators include SequentialExecutor, LocalExecutor, CeleryExecutor
SequentialExecutor is a sequential executor, and sqlite is used as the knowledge base by default. Due to the sqlite database, concurrent execution between tasks is not supported. It is often used in the test environment and does not require additional configuration. LocalExecutor is the executor, can not use sqlite as a knowledge base, can use a variety of mainstream databases such as mysql,postgress,db2,oracle, support concurrent execution between tasks, often used in production environment, need to configure database connection url.
CeleryExecutor is a Celery executor, so you need to install Celery. Celery is a distributed asynchronous task scheduling tool based on message queue. Additional startup of the worker node-worker is required. Use CeleryExecutor to run jobs on remote nodes.
Summarize today's content with a mind map:
The above is how to use the task scheduling artifact airflow. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.