Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

AirFlow introduction

2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/03 Report--

Introduction to AirFlow I. what is AirFlow?

   airflow is a platform for orchestrating, scheduling, and monitoring workflow, which is open source by Airbnb and is now incubated in Apache Software Foundation. Airflow orchestrates workflow as a DAGs (directed acyclic graph) composed of tasks, and the scheduler executes tasks on a set of workers according to specified dependencies. At the same time, airflow provides rich command-line tools and an easy-to-use user interface for users to view and operate, and airflow provides a monitoring and alarm system.

The scheduling of    Airflow depends on the crontab command. Compared with crontab, airflow can intuitively see the execution of tasks, the logical dependencies between tasks, set email reminders when tasks go wrong, and view the task execution log.

However, the way crontab commands are managed has the following disadvantages:

   1. In the case of multi-task scheduling and execution, it is difficult to sort out the dependencies between tasks.

   2. It is not easy to see which task is currently being executed.

   3. When a task fails, it is not easy to view the execution log, that is, it is not convenient to locate the task and the cause of the error.

   4. It is not easy to view the start and end time of each task execution under the scheduling flow, which is very important for optimizing task jobs.

   5. It is not easy to record the execution of historical scheduling tasks, which is important for optimizing jobs and troubleshooting errors.

1. Advantages and disadvantages Analysis without using airflow using airflow requires adding scheduling code, debugging complexity, single function, lack of overall scheduling capability framework scheduling, easy to use, more stable, comprehensive functions, and lack of graphical capabilities for overall scheduling, which brings a lot of difficulties to new tasks, troubleshooting and other operations. Especially when there are many tasks and complex structures, tree diagrams and flow charts are built in. To clearly show the topological structure of the tasks, you need to add the real-time monitoring code of the tasks and return the real-time status of the tasks to the web interface. Most of the operations that are convenient to manage and view tasks need to be coded or completed on the command line, and the common operation modes are not efficient enough to be converted into graphical interfaces. Efficient and clear need to manually separate scheduling from business code scheduling and business code separation, reduce coupling and facilitate operation and iteration

In addition to the above advantages, a disadvantage in engineering practice is that distributed deployment is troublesome and error-prone.

II. Assignments and tasks in AirFlow 1. DAG

Summary: DAG (Directed Acyclic Graph) is a directed acyclic graph, also known as a directed acyclic graph. In Airflow, a DAG defines a complete job. All Task in the same DAG have the same scheduling time.

Parameters:

Dag_id: uniquely identify DAG to facilitate future management of default_args: default parameter. If no corresponding parameter is configured for the job of the current DAG instance, the corresponding parameter schedule_interval in the default_args of the DAG instance is used: configure the execution cycle of the DAG. Use crontab syntax 2 and Task.

Summary: Task is a specific job task in DAG and depends on DAG, that is, it must exist in a DAG. Task can configure dependencies in DAG (of course, you can also configure cross-DAG dependencies, but it is not recommended. Cross-DAG dependencies can make DAG diagrams less intuitive and cause problems for dependency management.

Parameters:

Dag: pass a DAG instance to make the current job belong to the corresponding DAGtask_id: give the task an identifier (name) to facilitate future management of owner: the owner of the task, and facilitate future management of start_date: the start time of the task, that is, the task will be scheduled after this point in time. 3. Scheduling time of AirFlow 1, start_date

In configuration, it is the time when the job starts to be scheduled. When talking about execution, it is the scheduling start time.

2 、 schedule_interval

Schedule the execution cycle.

3 、 execution_date

Execution time. It is called execution time in Airflow, but it is not the real execution time.

[knock on the blackboard, draw key points]

So, the first scheduling time: the start_date configured in the job, and the point in time of the schedule_interval is met. The recorded execution_date is the first time that the start_date configured in the job meets the schedule_interval.

[give an example]

Assuming that we configure a job with a start_date of June 2, 2019 and a configured schedule_interval of 00 12, the time of first execution will be 12:00 on June 3, 2019. Therefore, execution_date does not literally represent the execution time as scheduled, but the real execution time is the next point in time shown by execution_date that satisfies schedule_interval.

IV. The core concepts of AirFlow

DAGs: a directed acyclic graph (Directed Acyclic Graph), which organizes all the tasks that needs to be run according to dependencies and describes the order in which all tasks are executed.

Operators: airflow has many built-in operators, such as BashOperator executes a bash command, PythonOperator calls any Python function, EmailOperator is used to send mail, HTTPOperator is used to send HTTP requests, and SqlOperator is used to execute SQL commands. At the same time, users can customize Operator, which provides users with great convenience. It can be understood as an operation required by the user, which is the class provided by Airflow.

Tasks: Task is an instance of Operator

Task Instance: because the Task is scheduled repeatedly, each time the task runs is a different task instance. Task instance has its own status, including "running", "success", "failed", "skipped", "up for retry" and so on.

Task Relationships: there can be dependencies between different Tasks in DAGs

Introduction of each component of AirFlow 1. Webserver

Provide web-side services, and regularly generate child processes to scan the dags in the corresponding directory and update the database

Webserver provides the following features:

Abort, resume, trigger the task. Monitor running tasks and continue running tasks at breakpoints. Execute ad-hoc commands or SQL statements to query the status, logs, and other details of the task. Configure connections, including connections that are not limited to databases, ssh, etc.

The webserver daemon uses a gunicorn server (the equivalent of tomcat in java) to process concurrent requests, and you can control the number of processes that process concurrent requests by modifying the value of workers in the {AIRFLOW_HOME} / airflow.cfg file.

For example:

Workers = 4 # indicates that 4 gunicorn worker (processes) are enabled to process web requests 2 and scheduler

Task scheduling service that periodically polls the schedule of tasks to determine whether task execution is triggered, generates tasks according to dags, and submits them to message middleware queues (redis or rabbitMq)

3 、 celery worker

Distributed on different machines, as the real execution node of the task. Get the task through the monitoring message middleware: redis or rabbitMq

The worker daemon needs to be turned on only when the executors of airflow is set to CeleryExecutor. It is recommended that you use CeleryExecutor in a production environment:

Executor = CeleryExecutor4, flower

Monitor the viability of the worker process, start or close the worker process, and view the running task

The default port is 5555, and you can enter "http://hostip:5555"" in the browser address bar to access flower to monitor the celery message queue.

6. AirFlow's ETL concept (data warehouse technology)

   ETL, which is the abbreviation of Extract-Transform-Load, is used to describe the process of extracting (extract), transforming (transform) and loading (load) data from the source to the destination. The word ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses.

   Airflow is designed to handle ETL tasks well, but its excellent design can be used to solve all kinds of dependency problems of tasks.

1. What is task dependence

   usually, in an operation and maintenance system, data analysis system, or test system and other large systems, we will have a variety of dependency requirements.

For example: time dependence: tasks need to wait for a certain point in time to trigger external system dependencies: tasks rely on data in Mysql, data in HDFS, and so on. These different external systems need to call interfaces to access machine dependencies: tasks can only be executed in the environment of a particular machine, which may have more memory. It is also possible that only that machine has special library file inter-task dependencies: task A needs to be started after task B is completed, and the two tasks will have an impact on each other. Resource dependence: tasks consume a lot of resources, and tasks that use the same resource need to be restricted. For example, running a data conversion task takes 10 gigabytes, and the machine has a total of 30 gigabytes. You can only run a maximum of two. I want similar tasks to queue up for permission dependence: a certain task can only be started by a user with certain permissions.

   you may think that these are the parts of the logic in the task program that need to be dealt with, but I think these logic can be abstracted as part of the task control logic and decoupled from the actual task execution logic.

2. How to understand Crontab

   now let's take a look at the most commonly used dependency management system, Crontab

   in a variety of systems, there are always some scheduled tasks to deal with, whenever at this time, the first thing we think of is always crontab.

   indeed, crontab can well handle the need to execute tasks on a regular basis, but for crontab, executing a task is so simple as calling a program, and all kinds of logic in the program does not fall within the jurisdiction of crontab (well follows KISS)

  , so we can abstractly think:

   crontab is a dependency management system that only manages time dependencies.

3. The way Airflow handles dependencies

The core concept of    Airflow is DAG (directed acyclic graph). DAG consists of one or more TASK, and this DAG solves the inter-task dependency mentioned above. After the execution of Task An is completed, Task B can be executed. The dependency relationship between multiple Task can be well expressed and perfected by DAG.

   Airflow fully supports crontab expressions, can also express time directly in python's datatime, and can express time difference in datatime's delta. This can solve the problem of time dependence of tasks.

   Airflow can launch Worker with different users under CeleryExecuter, and different Worker listens to different Queue, which can solve the problem of user permission dependence. Worker can also be started on several different machines to solve the problem of machine dependency.

   Airflow can specify an abstract Pool for any Task, and each Pool can specify a number of Slot. Every time a Task starts, it occupies a Slot, and when the number of Slot is full, the rest of the tasks are waiting. This solves the problem of resource dependence.

There is a Hook mechanism in    Airflow (in fact, I don't think it should be called Hook) to establish a connection with external data systems, such as Mysql,HDFS, local file system (file system is also considered to be an external system), etc., by expanding the interface that Hook can access any external system to connect, so as to solve the problem of external system dependence.

VII. The dispatching mode of AirFlow

   Architecture Chart of different Executer Airflow performs tasks in a variety of ways, including SequentialExecutor, LocalExecutor and CeleryExecutor. LocalExecutor and CeleryExecutor are most commonly used. Here are the architectures of the three execution methods:

1. System architecture based on CeleryExecutor.

   uses a celery-style system architecture diagram (which is officially recommended and also supports mesos deployment). Turing is an external system, and GDags services help splice into dag, which can be ignored.

   1.The master node webui manages dags, logs, and other information. Scheduler is responsible for scheduling and only supports single node. Multi-node startup scheduler may fail.

   2. Worker is responsible for executing the task in a specific dag. This allows different task to be executed in different environments.

2. System architecture diagram based on LocalExecutor.

Another way to start    is to think that a dag is assigned to a machine to execute. If the task is not complex and the task environment is the same, you can use this method to facilitate capacity expansion and management, and there is no master single point problem.

3. System architecture diagram based on SequentialExecutor.

   SequentialExecutor represents the sequential execution of a single process and is usually used only for testing.

But sometimes configuring job dependencies and scheduling execution cycles alone cannot meet some complex requirements.

4. Other task scheduling mode

1) Skip non-up-to-date DAG Run (failure occurs in the job and recover after a period of time)

2) when there is an executing DAG Run, skip the current DAG Run (the job execution time is too long until the next job starts)

3) the alternative to Sensor (there is a class of Operator in Airflow called Sensor,Sensor that can sense whether the pre-set conditions are met, and when the conditions are met, the Sensor job becomes Success so that the downstream job can be executed. The disadvantage is that if the upstream operation is executed for 3 hours, it will take up worker for 3 hours without releasing, and resources will be wasted. )

VIII. The original structure of AirFlow

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report