In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
Introduction to AirFlow
Airflow is an Airbnb Workflow open source project that has more than 2, 000 stars on Github. Airflow is written in Python and supports two versions of Python 2 and 3. Traditional Workflow usually uses Text Files (json, xml / etc) to define DAG, and then Scheduler parses these DAG files to form a specific Task Object implementation; Airflow does not do this, it directly uses Python to write DAG definition, which suddenly breaks through the limitation of the expression ability of text files, and the definition of DAG becomes simple. In addition, the permission design, current limit design, and Hook/Plugin design of Airflow are all interesting, with good functionality and expansibility. Of course, the quality of the code in the project is relatively general, and in many places the function name and implementation are not consistent, resulting in difficulties in understanding; there are also many Flag and repeated definitions, which are obviously the result of not being well designed at the beginning and not having the energy to Refactor to Hack later. But on the whole, in readability, the expansibility of the system is very good.
Webserver UI of AirFlow
DAGS
The On/Off button on the left controls the running state of DAG. Off is paused and On is running. Note: all DAG scripts are in the Off state when the initial deployment is complete.
If the DAG name is unclickable, the DAG may be deleted or not loaded. If the DAG is not loaded, click the refresh button on the right to refresh. Note: because you can deploy several WebServer, you may not be able to flush all WebServer caches in a single refresh, and you can try to refresh multiple times.
Recent Tasks displays the running status of the Task Instances (which can be understood as the execution record of the job) in the most recent DAG Run (which can be understood as the execution record of the DAG). If the status of the DAG Run is running, it displays the status of the most recently completed and all Task Instances in the running DAG Run.
Last Run displays the most recent execution date. Note: execution date is not the real execution time, the details are detailed in the following DAG configuration. Move your mouse over the info tag on the right side of the execution date and it will show that the start date,start date is the real elapsed time. Start date is generally the next execution time corresponding to execution date. Job action box
In both the tree view and the DAG diagram of DAG, you can click the corresponding Task Instance to pop up the Task Instance mode box to carry out the related operations of Task Instance. Note: the selected Task Instance is the Task Instance in the corresponding DAG Run.
There is a funnel symbol to the right of the job name, and when clicked, the entire DAG interface will show only the job and its dependent jobs. This feature is of great help when the DAG of the job is large.
The Task Instance Details displays the details of the Task Instance, from which you can know the current state of the Task Instance and why it is in the current state. For example, if the Task Instance is in no status state and does not enter queued and running state for a long time, you can know the reason through Dependency and Reason in Task Instance Details.
Rendered displays the command after the Task Instance is rendered.
The Run instruction can execute the current job directly.
The Clear directive clears the current Task Instance state, and clearing any Task Instance will change the current DAG Run state to running. Note: if the status of the cleared Task Instance is running, an attempt is made to kill the instructions executed by the Task Instance, enter the shutdown state, and mark the execution as failed after the kill is completed (or up_for_retry if the number of retry is not used up). Clear has five additional options, all multiple selections, from left to right:
Past: also clears the Task Instance corresponding to this Task Instance in all past DAG Run. Future: also clears the Task Instance corresponding to this Task Instance in all future DAG Run. Note: clear only the Task Instance in the generated DAG Run. Upstream: clears all Task Instance upstream of this Task Instance in the DAG Run at the same time. Downstream: clear all Task Instance downstream of this Task Instance in the DAG Run at the same time. Recursive: when this Task Instance is sub DAG, loop clears all Task Instance in that sub DAG. Note: ignore this option if this Task Instance is not sub DAG.
The Mark Success instruction marks the current Task Instance status as success. Note: if the status of the Task Instance is running, an attempt is made to kill the instructions executed by the Task Instance, enter the shutdown state, and mark the execution as failed after the kill is completed (or up_for_retry if the number of retry is not used up).
The tree representation of DAG across time. If the pipeline (pipeline) is delayed, you can quickly see where the wrong step occurred and identify the clogged process.
The graphical view is probably the most comprehensive form of representation. It can visualize your DAG dependencies and the current state of a running instance.
The duration of different tasks that have been run N times in the past. From this view, you can find outliers and quickly understand how long DAG takes to run multiple times.
The Gantt chart allows you to analyze task duration and overlap. You can quickly identify system bottlenecks and which specific DAG takes a lot of time to run.
Transparency is everything. Although your pipeline (pipeline) code is under source control, this is a quick way to get DAG code and provide more context.
From the above page (tree view, graphical view, Gantt chart.), you can always click on the task instance and enter this rich context menu, which takes you to more detailed metadata and performs some actions.
View the log
All task instances
Recorded the operation of all DAG
Connection information for external systems is stored in the Airflow metadata database and managed in UI (Menu-> Admin-> Connections). Conn_idconn_id is defined there without having to hard-code any such information anywhere.
Many connections with the same conn_id can be defined, and in this case, when the hook uses the get_connection method from BaseHook, Airflow will randomly select a connection, allowing some basic load balancing and fault tolerance when used with retry.
Airflow can also reference connections through environment variables in the operating system. But it only supports URI format. If you need to specify extra information for the connection, use Web UI.
If the same conn_id connection is defined in both the Airflow metadata database and the environment variable, Airflow will reference only the connection in the environment variable (for example, given conn_idpostgres_master, Airflow will first search for AIRFLOW_CONN_POSTGRES_MASTER in the environment variable and reference it directly before starting to search for the metadata database).
Many hooks have a default conn_id, and Operator that uses this hook does not need to provide an explicit connection ID. For example, the default conn_id for PostgresHook is postgres_default
XComs
XComs allows tasks to exchange messages, allowing for more subtle control of the form and shared state. The name is an abbreviation for "cross communication". XComs is mainly defined by keys, values, and timestamps, but also tracks the task / DAG that created the XCom and the properties that should be visible. Any object that can be used by pickle can be used as a XCom value, so users should ensure that objects of the appropriate size are used.
Variables are a common way to store and retrieve arbitrary content or settings as simple key values in Airflow. Variables can be listed, created, updated, and deleted from UI (Admin-> Variables), code, or CLI. In addition, json settings files can be uploaded in bulk via UI. Although pipe code definitions and most constants and variables should be defined in code and stored in source code control, it is useful to be able to access and modify certain variables or configuration items through UI.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.