Installation and deployment of AirFlow 07/12 Update SLTechnology News&Howtos

Installation and deployment of AirFlow

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/03 Report--

Introduction to airflow

Airflow is an Airbnb Workflow open source project that has more than 2, 000 stars on Github. Airflow is written in Python and supports two versions of Python 2 and 3. Traditional Workflow usually uses Text Files (json, xml / etc) to define DAG, and then Scheduler parses these DAG files to form a specific Task Object implementation; Airflow does not do this, it directly uses Python to write DAG definition, which suddenly breaks through the limitation of the expression ability of text files, and the definition of DAG becomes simple. In addition, the permission design, current limit design, and Hook/Plugin design of Airflow are all interesting, with good functionality and expansibility. Of course, the quality of the code in the project is relatively general, and in many places the function name and implementation are not consistent, resulting in difficulties in understanding; there are also many Flag and repeated definitions, which are obviously the result of not being well designed at the beginning and not having the energy to Refactor to Hack later. But on the whole, in readability, the expansibility of the system is very good.

Installation of airflow related software

Python 3.6.5 installation

Install dependent programs

[root@node01 ~] # yum-y install zlib zlib-devel bzip2 bzip2-devel ncurses ncurses-devel readline readline-devel openssl openssl-devel openssl-static xz lzma xz-devel sqlite sqlite-devel gdbm gdbm-devel tk tk-devel gcc

Download python

You can go to https://www.python.org/ftp/python/ to view the various versions of Python, here, we choose to install the Python-3.6.5.tgz version. Download the Python source code package with the following command:

[root@node01 ~] # wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tgz

Decompress the Python source code compression package

[root@node01 ~] # tar-zxvf Python-3.6.5.tgz [root@node01 ~] # cd Python-3.6.5

Install python

[root@node01 Python-3.6.5] #. / configure prefix=/usr/local/python3 [root@node01 Python-3.6.5] # make [root@node01 Python-3.6.5] # make install [root@node01 Python-3.6.5] # mv / usr/bin/python / usr/bin/python2 # rename the original execution script to backup [root@node01 Python-3.6.5] # ln-s / usr/local/python3/bin/python3 / usr/bin/python # Link the new python execution script to the command path [root@node01 Python-3.6.5] # ln-s / usr/local/python3/bin/pip3 / usr/bin/pip # Link pip3 into the command path

View version information

Modify yum-related configuration files

Change the beginning of the following two files #! / usr/bin/python to #! / usr/bin/python2.7 [root@node01 Python-3.6.5] # vim / usr/bin/yum [root@node01 Python-3.6.5] # vim / usr/libexec/urlgrabber-ext-down

Pip3 installation

The default python 3.6.5 already comes with pip3, so no additional installation is required.

If it is python2.x, you need to install pip.

MySQL 5.7.28 installation

Decompress the MySQL software (the file size is more than 600 megabytes, which can be downloaded directly from the official website without installation)

[root@node01 ~] # tar zxvf mysql-5.7.28-linux-glibc2.12-x86_ 64.tar.gz [root @ node01 ~] # mv mysql-5.7.28-linux-glibc2.12-x86_64 / opt/mysql [root@node01 ~] # cd / opt/mysql/

Create users and groups needed for MySQL to run

[root@node01 mysql] # groupadd mysql [root@node01 mysql] # useradd-M-s / sbin/nologin-g mysql mysql

Create a directory to store data

[root@node01 mysql] # mkdir data [root@node01 mysql] # chown-R mysql:mysql *

Write a configuration file

Vim / etc/my.cnf

[client] port = 3306socket = / opt/mysql/data/ mysql.sock [mysqld] port = 3306socket = / opt/mysql/data/mysql.sockbasedir = / opt/mysqldatadir = / opt/mysql/datauser = mysqlbind-address = 0.0.0.0server-id = 1init-connect = 'SET NAMES utf8'character-set-server = utf8

Initialize the database

[root@node01 mysql] # bin/mysqld-initialize-basedir=/opt/mysql-datadir=/opt/mysql/data-user=mysql

Add the database to the service

[root@node01 mysql] # cp support-files/mysql.server / etc/init.d/mysqld [root@node01 mysql] # cp / opt/mysql/bin/* / usr/sbin/ [root@node01 mysql] # chmod + x / etc/init.d/mysqld add post [root@node01 mysql] # chkconfig-- add mysqld [root@node01 mysql] # service mysqld start

[root@mysql_node01 mysql] # mysql-uroot-p

Change the root password, otherwise nothing can be done

Mysql > alter user root@ "localhost" identified by "123456"; Query OK, 0 rows affected (0.00 sec) test whether the account can be added or deleted mysql > create database abctest;Query OK, 1 row affected (0.00 sec)

Mysql > use abctest

Database changed

Mysql > drop database abctest

Query OK, 0 rows affected (0.00 sec)

> 12. Authorized airflow users can be accessed; ```bashmysql > grant all on *. * to 'airflow'@'%' identified by' 123456a query OK, 0 rows affected, 1 warning (0.00 sec) mysql > flush privileges;Query OK, 0 rows affected (0.00 sec)

Create the database to be used by airflow, with the name airflow

Mysql > create database airflow; Query OK, 1 row affected (0.00 sec) enable timestamp, add current time to data update; mysql > set @ @ global.explicit_defaults_for_timestamp=on;Query OK, 0 rows affected (0.00 sec) redis installation

Decompress redis (if you need to download, please download it directly to the official website)

[root@node01 ~] # tar zxvf redis-5.0.0.tar.gz [root@node01 ~] # cd redis-5.0.0/

Install redis

[root@node01 redis-5.0.0] # make & & make test [root@node01 redis-5.0.0] # mkdir / usr/local/redis # create a directory where redis runs [root@node01 redis-5.0.0] # cp redis.conf / usr/local/redis/ # copy configuration files to redis run directory [root@node01 redis-5.0.0] # cp src/redis-server / usr/sbin/ # copy Redis-server program [root@node01 redis-5.0.0] # cp src/redis-cli / usr/sbin/ # copy redis-cli program [root@node01 redis-5.0.0] # cp src/redis-sentinel / usr/sbin/ # copy redis-sentinel program

Edit redis profile

[root@node01 redis-5.0.0] # vim / usr/local/redis/redis.conf bind 0.0.0.0 # external binding address daemonize yes # daemon pidfile / usr/local/redis/redis_6379.pid # specify pidlogfile "/ usr/local/redis/redis.log" # specify logdir "/ usr/local/redis" # specify home directory appendonly yes # enable persistence

Start redis

[root@node01 redis-5.0.0] # redis-server / usr/local/redis/redis.conf connection redis to test [root@node01 ~] # redis-cli 127.0.0.1 redis 6379 > keys * (empty list or set) RabbitMQ installation

Install rabbitmq and erlang (the program can be downloaded directly from the official website)

[root@node01] # yum localinstall erlang-20.3-1.el7.centos.x86_64.rpm rabbitmq-server-3.7.5-1.el7.noarch.rpm-y

Copy the profile template, and then modify the profile (enable the default login user)

[root@node01 ~] # cp / usr/share/doc/rabbitmq-server-3.7.5/rabbitmq.config.example / etc/rabbitmq/rabbitmq.config [root@node01 ~] # vim / etc/rabbitmq/rabbitmq.config

Open the management plug-in and start the program; [root@node01 ~] # rabbitmq-plugins enable rabbitmq_management # Open the management plug-in [root@node01 ~] # chown rabbitmq:rabbitmq / var/lib/rabbitmq/.erlang.cookie # set cookie file permissions (otherwise you can't start) [root@node01 ~] # systemctl start rabbitmq-server.service # start rabbitmq [root @ node01 ~] # systemctl enable rabbitmq-server.service # add self-boot

Open the administration page http://ip:15672 and enter using the default user guest and password guest

Click admin-> > add a user, add the user password and set the administrator permission

Add a virtual directory for admin users /

Click user admin

Select the corresponding Virtual Host and click Set permission to set it.

Re-check and find that it has been set up successfully

Airflow single node deployment

The single-node deployment of airflow can be completed by running all daemons on the same machine, as shown in the following figure.

Architecture diagram

Step installation dependency (needed for airflow [mysql]); [root@node01 ~] # yum install mariadb-devel MySQL-python-y [root@node01 ~] # mkdir / usr/local/mysql # create directory

[root@node01 ~] # find /-name mysql.h # check mysql.h and find two. Because node01 has installed MySQL, there are two files. The / usr/local/mysql/include/mysql.h file will be called when installing airflow [MySQL].

/ usr/include/mysql/mysql.h

/ opt/mysql/include/mysql.h

[root@node01 ~] # ln-s / opt/mysql / usr/local/mysql # Link the MySQL installation directory to / usr/local/mysql

[root@node01 airflow] # ln-s / opt/mysql/lib/libmysqlclient.so.20 / usr/lib64/libmysqlclient.so.20

> 2. Add environment variabl

Vim / etc/profile

Export PATH=$PATH:/usr/local/python3/bin/

Export AIRFLOW_HOME=~/airflow

> 3. Install airflow and related applications

[root@node01 ~] # pip install-- upgrade pip # Update pip

[root@node01 ~] # pip install-- upgrade setuptools # Update setuptools

[root@node01 ~] # pip install apache-airflow # install the airflow main program

[root@node01 ~] # pip install apache-airflow [mysql] # install mysql connector

[root@node01 ~] # pip install apache-airflow [celery] # install celery connector

[root@node01 ~] # pip install redis # install redis Connector

[root@node01 ~] # pip install-- upgrade kombu # updates kombu (celery requires the latest version)

> 4. Initialize the airflow database

[root@node01 ~] # airflow initdb # or you can just use the airflow command to load the configuration file

> 5. After initialization, some files related to airflow will be generated to view the contents of the airflow directory;! [insert picture description here] (https://img-blog.csdnimg.cn/20191227093123311.png)> 6. Modify the configuration file; vim / root/airflow/airflow.cfg

Time zone settin

Default_timezone = Asia/Shanghai

Do not load cases

Load_examples = False

Execute webserver default startup port

Web_server_port = 8080

Actuator used

Executor = CeleryExecutor

Set up the intermediate proxy for the message

Broker_url = redis://192.168.255.16:6379/0

Set the result storage backend backend

Of course, you can also use Redis: result_backend = redis://redis:Kingsoftcom_123@172.19.131.108:6379/1

Result_backend = db+mysql://airflow:123456a@192.168.255.16:3306/airflow

Database connection

Sql_alchemy_conn = mysql://airflow:123456a@192.168.255.16:3306/airflow

> 7. Enable timestamp and add the current time when the data is updated

Mysql-uairflow-p

Mysql > set @ @ global.explicit_defaults_for_timestamp=on

Query OK, 0 rows affected (0.00 sec)

> 8. Regenerate the database file

Airflow initdb

> 9. Start

[root@node01 airflow] # airflow webserver-p 8080-D

[root@node01 airflow] # airflow scheduler-D

[root@node01 airflow] # airflow flower-D

[root@node01 airflow] # airflow worker-D # default startup error because worker is not allowed to be started by ROOT users

> 10. Set worker can be started by root user > add export C_FORCE_ROOT=True content in / etc/profile file, source / etc/profile can be reloaded > 11. Check whether a single node has been set up > Open http://192.168.255.11:8080 and http://192.168.255.11:5050 respectively to check whether ok! [insert picture description here] (https://img-blog.csdnimg.cn/20191227093149183.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)! [insert image description here] (https://img-blog.csdnimg.cn/20191227093156143.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)### airflow multi-node (cluster) deployment > is generally deployed in scenarios that require high stability, such as financial trading systems, using clusters and high availability. Apache Airflow also supports cluster and highly available deployments. Airflow daemons can run on multiple machines. The architecture is shown in the following figure: # Architecture Diagram! [insert image description here] (https://img-blog.csdnimg.cn/20191227093258334.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)##### Multi-node benefits > High availability > if a worker node crashes or goes offline, the cluster can still be controlled, and the tasks of other worker nodes will still be performed. processing if you have some memory-intensive tasks in your workflow, it is best to run them on multiple machines for faster execution. # expand worker nodes # scale horizontally > you can distribute the process by adding more worker nodes to the cluster to horizontally expand the cluster and point these new nodes to the same Metabase. Since worker does not need to be registered with any daemon to execute tasks, worker nodes can be extended without downtime or service restart, that is, they can be extended at any time. # Vertical scaling > you can scale the cluster vertically by increasing the number of daemons on a single worker node. You can do this by modifying the value of celeryd_concurrency in the configuration file-{AIRFLOW_HOME} / airflow.cfg of airflow, for example:

Celeryd_concurrency = 30

> you can increase the number of concurrent processes to meet the actual needs according to the actual situation, such as the nature of tasks running on the cluster, the number of cores of CPU, etc. # expand Master node (highly available) > you can also add more master nodes to the cluster to expand the services running on the master node. You can extend the webserver daemon to prevent too many HTTP requests from appearing on one machine, or you want to provide higher availability for webserver's services. One thing to note is that only one scheduler daemon can be run at a time. If you have more than one scheduler running, it is possible that a task will be executed multiple times. This may cause some problems with your workflow due to repeated runs. > the following figure shows the architecture diagram of the extended Master node:! [insert image description here] (when https://img-blog.csdnimg.cn/20191227093318702.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)> sees this, some people may ask that scheduler cannot run two at the same time, so if something goes wrong with the node running scheduler, the task will not run at all. > > answer: this is a very good question, but there is already a solution. We can deploy scheduler on two machines and run only the scheduler daemon on one machine. Once the machine running the scheduler daemon fails, start scheduler on the other machine immediately. We can use the third-party component airflow-scheduler-failover-controller to achieve the high availability of scheduler. > > both master install and run failover, so that the real high availability can be achieved. The specific steps are as follows: > 1. Download failover

Git clone https://github.com/teamclairvoyant/airflow-scheduler-failover-controller

> 2. Use pip for installation

Cd airflow-scheduler-failover-controller

Pip install-e.

> 3. Initialize failover

Scheduler_failover_controller init

> Note: during initialization, the content will be appended to airflow.cfg, so you need to install airflow and initialize it first. > 4. Change failover configuration

Scheduler_nodes_in_cluster= host1,host2

Note: host name can be obtained from the scheduler_failover_controller get_current_host command > 5. Configure secret-free login between the machines where failover is installed. After the configuration is complete, you can use the following command to verify:

Scheduler_failover_controller test_connection

> 6. Start failover

Scheduler_failover_controller start

> Note: when failover monitors scheduler, it will fail to start. You should modify the / root/airflow/airflow.cfg

Airflow_scheduler_start_command = export AIRFLOW_HOME=/root/airflow;;nohup airflow scheduler > > ~ / airflow/logs/scheduler.logs &

Change to airflow_scheduler_start_command = airflow scheduler-D

> so the more robust architecture diagram is as follows:! [insert picture description here] (high availability of https://img-blog.csdnimg.cn/20191227093337590.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)##### queue service and Metabase (Metestore). > the queuing service depends on whether the message queues used are highly usable and deployable, such as RabbitMQ and Redis. > RabbitMQ cluster and configure Mirrored mode see: http://blog.csdn.net/u010353408/article/details/77964190> Metabase (Metestore) depends on the database used, such as Mysql, etc. > for Mysql as master and slave backup, see: specific steps for http://blog.csdn.net/u010353408/article/details/77964157#### airflow cluster deployment # prerequisites > the daemon processes that nodes run are as follows: * master1 run: webserver, scheduler* master2 run: webserver* worker1 run: worker* worker2 run: worker > queue service is running. (RabbitMQ, Redis, etc) * for more information on how to install RabbitMQ, please see: if http://site.clairvoyantsoft.com/installing-rabbitmq/* is using RabbitMQ, it is recommended that RabbitMQ be deployed in a highly available cluster and load balancers are configured for RabbitMQ instances. # step > 1. Install Apache Airflow on all machines that need to run daemons. > 2. Modify the {AIRFLOW_HOME} / airflow.cfg file to ensure that all machines use the same configuration file. > 1. Change Executor to CeleryExecutor

Executor = CeleryExecutor

> 2. Specify Metabase (metestore)

Sql_alchemy_conn = mysql:// {USERNAME}: {PASSWORD} @ {MYSQL_HOST}: 3306/airflow

> 3. Set broker > if you use RabbitMQ

Broker_url = amqp://guest:guest@ {RABBITMQ_HOST}: 5672 /

if you use Redis

Broker_url = redis:// {REDIS_HOST}: 6379 use database 0

> 4. Set the result storage backend backend

Celery_result_backend = db+mysql:// {USERNAME}: {PASSWORD} @ {MYSQL_HOST}: 3306/airflow

# of course, you can also use Redis: celery_result_backend = redis:// {REDIS_HOST}: 6379Universe 1

> 3. Deploy your workflow (DAGs) on master1 and master2. > 4. In master 1, the Metabase of the initial airflow

$airflow initdb

> 5. In master1, start the corresponding daemon

$airflow webserver

$airflow scheduler

> 6. In master2, start Web Server

$airflow webserver

> 7. Start worker in worker1 and worker2

$airflow worker

> 8. Use load balancer to deal with webserver > you can use servers such as nginx,AWS to handle webserver load balancer. All of them have been clustered or highly available, and the apache-airflow system is indestructible.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.