In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-14 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/03 Report--
Introduction to airflow
Airflow is an Airbnb Workflow open source project that has more than 2, 000 stars on Github. Airflow is written in Python and supports two versions of Python 2 and 3. Traditional Workflow usually uses Text Files (json, xml / etc) to define DAG, and then Scheduler parses these DAG files to form a specific Task Object implementation; Airflow does not do this, it directly uses Python to write DAG definition, which suddenly breaks through the limitation of the expression ability of text files, and the definition of DAG becomes simple. In addition, the permission design, current limit design, and Hook/Plugin design of Airflow are all interesting, with good functionality and expansibility. Of course, the quality of the code in the project is relatively general, and in many places the function name and implementation are not consistent, resulting in difficulties in understanding; there are also many Flag and repeated definitions, which are obviously the result of not being well designed at the beginning and not having the energy to Refactor to Hack later. But on the whole, in readability, the expansibility of the system is very good.
Installation of airflow related software
Python 3.6.5 installation
Install dependent programs
[root@node01 ~] # yum-y install zlib zlib-devel bzip2 bzip2-devel ncurses ncurses-devel readline readline-devel openssl openssl-devel openssl-static xz lzma xz-devel sqlite sqlite-devel gdbm gdbm-devel tk tk-devel gcc
Download python
You can go to https://www.python.org/ftp/python/ to view the various versions of Python, here, we choose to install the Python-3.6.5.tgz version. Download the Python source code package with the following command:
[root@node01 ~] # wget https://www.python.org/ftp/python/3.6.5/Python-3.6.5.tgz
Decompress the Python source code compression package
[root@node01 ~] # tar-zxvf Python-3.6.5.tgz [root@node01 ~] # cd Python-3.6.5
Install python
[root@node01 Python-3.6.5] #. / configure prefix=/usr/local/python3 [root@node01 Python-3.6.5] # make [root@node01 Python-3.6.5] # make install [root@node01 Python-3.6.5] # mv / usr/bin/python / usr/bin/python2 # rename the original execution script to backup [root@node01 Python-3.6.5] # ln-s / usr/local/python3/bin/python3 / usr/bin/python # Link the new python execution script to the command path [root@node01 Python-3.6.5] # ln-s / usr/local/python3/bin/pip3 / usr/bin/pip # Link pip3 into the command path
View version information
Modify yum-related configuration files
Change the beginning of the following two files #! / usr/bin/python to #! / usr/bin/python2.7 [root@node01 Python-3.6.5] # vim / usr/bin/yum [root@node01 Python-3.6.5] # vim / usr/libexec/urlgrabber-ext-down
Pip3 installation
The default python 3.6.5 already comes with pip3, so no additional installation is required.
If it is python2.x, you need to install pip.
MySQL 5.7.28 installation
Decompress the MySQL software (the file size is more than 600 megabytes, which can be downloaded directly from the official website without installation)
[root@node01 ~] # tar zxvf mysql-5.7.28-linux-glibc2.12-x86_ 64.tar.gz [root @ node01 ~] # mv mysql-5.7.28-linux-glibc2.12-x86_64 / opt/mysql [root@node01 ~] # cd / opt/mysql/
Create users and groups needed for MySQL to run
[root@node01 mysql] # groupadd mysql [root@node01 mysql] # useradd-M-s / sbin/nologin-g mysql mysql
Create a directory to store data
[root@node01 mysql] # mkdir data [root@node01 mysql] # chown-R mysql:mysql *
Write a configuration file
Vim / etc/my.cnf
[client] port = 3306socket = / opt/mysql/data/ mysql.sock [mysqld] port = 3306socket = / opt/mysql/data/mysql.sockbasedir = / opt/mysqldatadir = / opt/mysql/datauser = mysqlbind-address = 0.0.0.0server-id = 1init-connect = 'SET NAMES utf8'character-set-server = utf8
Initialize the database
[root@node01 mysql] # bin/mysqld-initialize-basedir=/opt/mysql-datadir=/opt/mysql/data-user=mysql
Add the database to the service
[root@node01 mysql] # cp support-files/mysql.server / etc/init.d/mysqld [root@node01 mysql] # cp / opt/mysql/bin/* / usr/sbin/ [root@node01 mysql] # chmod + x / etc/init.d/mysqld add post [root@node01 mysql] # chkconfig-- add mysqld [root@node01 mysql] # service mysqld start
Log in to MySQL. The password will be prompted before initialization.
[root@mysql_node01 mysql] # mysql-uroot-p
Change the root password, otherwise nothing can be done
Mysql > alter user root@ "localhost" identified by "123456"; Query OK, 0 rows affected (0.00 sec) test whether the account can be added or deleted mysql > create database abctest;Query OK, 1 row affected (0.00 sec)
Mysql > use abctest
Database changed
Mysql > drop database abctest
Query OK, 0 rows affected (0.00 sec)
> 12. Authorized airflow users can be accessed; ```bashmysql > grant all on *. * to 'airflow'@'%' identified by' 123456a query OK, 0 rows affected, 1 warning (0.00 sec) mysql > flush privileges;Query OK, 0 rows affected (0.00 sec)
Create the database to be used by airflow, with the name airflow
Mysql > create database airflow; Query OK, 1 row affected (0.00 sec) enable timestamp, add current time to data update; mysql > set @ @ global.explicit_defaults_for_timestamp=on;Query OK, 0 rows affected (0.00 sec) redis installation
Decompress redis (if you need to download, please download it directly to the official website)
[root@node01 ~] # tar zxvf redis-5.0.0.tar.gz [root@node01 ~] # cd redis-5.0.0/
Install redis
[root@node01 redis-5.0.0] # make & & make test [root@node01 redis-5.0.0] # mkdir / usr/local/redis # create a directory where redis runs [root@node01 redis-5.0.0] # cp redis.conf / usr/local/redis/ # copy configuration files to redis run directory [root@node01 redis-5.0.0] # cp src/redis-server / usr/sbin/ # copy Redis-server program [root@node01 redis-5.0.0] # cp src/redis-cli / usr/sbin/ # copy redis-cli program [root@node01 redis-5.0.0] # cp src/redis-sentinel / usr/sbin/ # copy redis-sentinel program
Edit redis profile
[root@node01 redis-5.0.0] # vim / usr/local/redis/redis.conf bind 0.0.0.0 # external binding address daemonize yes # daemon pidfile / usr/local/redis/redis_6379.pid # specify pidlogfile "/ usr/local/redis/redis.log" # specify logdir "/ usr/local/redis" # specify home directory appendonly yes # enable persistence
Start redis
[root@node01 redis-5.0.0] # redis-server / usr/local/redis/redis.conf connection redis to test [root@node01 ~] # redis-cli 127.0.0.1 redis 6379 > keys * (empty list or set) RabbitMQ installation
Install rabbitmq and erlang (the program can be downloaded directly from the official website)
[root@node01] # yum localinstall erlang-20.3-1.el7.centos.x86_64.rpm rabbitmq-server-3.7.5-1.el7.noarch.rpm-y
Copy the profile template, and then modify the profile (enable the default login user)
[root@node01 ~] # cp / usr/share/doc/rabbitmq-server-3.7.5/rabbitmq.config.example / etc/rabbitmq/rabbitmq.config [root@node01 ~] # vim / etc/rabbitmq/rabbitmq.config
Open the management plug-in and start the program; [root@node01 ~] # rabbitmq-plugins enable rabbitmq_management # Open the management plug-in [root@node01 ~] # chown rabbitmq:rabbitmq / var/lib/rabbitmq/.erlang.cookie # set cookie file permissions (otherwise you can't start) [root@node01 ~] # systemctl start rabbitmq-server.service # start rabbitmq [root @ node01 ~] # systemctl enable rabbitmq-server.service # add self-boot
Open the administration page http://ip:15672 and enter using the default user guest and password guest
Click admin-> > add a user, add the user password and set the administrator permission
Add a virtual directory for admin users /
Click user admin
Select the corresponding Virtual Host and click Set permission to set it.
Re-check and find that it has been set up successfully
Airflow single node deployment
The single-node deployment of airflow can be completed by running all daemons on the same machine, as shown in the following figure.
Architecture diagram
Step installation dependency (needed for airflow [mysql]); [root@node01 ~] # yum install mariadb-devel MySQL-python-y [root@node01 ~] # mkdir / usr/local/mysql # create directory
[root@node01 ~] # find /-name mysql.h # check mysql.h and find two. Because node01 has installed MySQL, there are two files. The / usr/local/mysql/include/mysql.h file will be called when installing airflow [MySQL].
/ usr/include/mysql/mysql.h
/ opt/mysql/include/mysql.h
[root@node01 ~] # ln-s / opt/mysql / usr/local/mysql # Link the MySQL installation directory to / usr/local/mysql
[root@node01 airflow] # ln-s / opt/mysql/lib/libmysqlclient.so.20 / usr/lib64/libmysqlclient.so.20
> 2. Add environment variabl
Vim / etc/profile
Export PATH=$PATH:/usr/local/python3/bin/
Export AIRFLOW_HOME=~/airflow
> 3. Install airflow and related applications
[root@node01 ~] # pip install-- upgrade pip # Update pip
[root@node01 ~] # pip install-- upgrade setuptools # Update setuptools
[root@node01 ~] # pip install apache-airflow # install the airflow main program
[root@node01 ~] # pip install apache-airflow [mysql] # install mysql connector
[root@node01 ~] # pip install apache-airflow [celery] # install celery connector
[root@node01 ~] # pip install redis # install redis Connector
[root@node01 ~] # pip install-- upgrade kombu # updates kombu (celery requires the latest version)
> 4. Initialize the airflow database
[root@node01 ~] # airflow initdb # or you can just use the airflow command to load the configuration file
> 5. After initialization, some files related to airflow will be generated to view the contents of the airflow directory;! [insert picture description here] (https://img-blog.csdnimg.cn/20191227093123311.png)> 6. Modify the configuration file; vim / root/airflow/airflow.cfg
Time zone settin
Default_timezone = Asia/Shanghai
Do not load cases
Load_examples = False
Execute webserver default startup port
Web_server_port = 8080
Actuator used
Executor = CeleryExecutor
Set up the intermediate proxy for the message
Broker_url = redis://192.168.255.16:6379/0
Set the result storage backend backend
Of course, you can also use Redis: result_backend = redis://redis:Kingsoftcom_123@172.19.131.108:6379/1
Result_backend = db+mysql://airflow:123456a@192.168.255.16:3306/airflow
Database connection
Sql_alchemy_conn = mysql://airflow:123456a@192.168.255.16:3306/airflow
> 7. Enable timestamp and add the current time when the data is updated
Mysql-uairflow-p
Mysql > set @ @ global.explicit_defaults_for_timestamp=on
Query OK, 0 rows affected (0.00 sec)
> 8. Regenerate the database file
Airflow initdb
> 9. Start
[root@node01 airflow] # airflow webserver-p 8080-D
[root@node01 airflow] # airflow scheduler-D
[root@node01 airflow] # airflow flower-D
[root@node01 airflow] # airflow worker-D # default startup error because worker is not allowed to be started by ROOT users
> 10. Set worker can be started by root user > add export C_FORCE_ROOT=True content in / etc/profile file, source / etc/profile can be reloaded > 11. Check whether a single node has been set up > Open http://192.168.255.11:8080 and http://192.168.255.11:5050 respectively to check whether ok! [insert picture description here] (https://img-blog.csdnimg.cn/20191227093149183.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)! [insert image description here] (https://img-blog.csdnimg.cn/20191227093156143.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)### airflow multi-node (cluster) deployment > is generally deployed in scenarios that require high stability, such as financial trading systems, using clusters and high availability. Apache Airflow also supports cluster and highly available deployments. Airflow daemons can run on multiple machines. The architecture is shown in the following figure: # Architecture Diagram! [insert image description here] (https://img-blog.csdnimg.cn/20191227093258334.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)##### Multi-node benefits > High availability > if a worker node crashes or goes offline, the cluster can still be controlled, and the tasks of other worker nodes will still be performed. processing if you have some memory-intensive tasks in your workflow, it is best to run them on multiple machines for faster execution. # expand worker nodes # scale horizontally > you can distribute the process by adding more worker nodes to the cluster to horizontally expand the cluster and point these new nodes to the same Metabase. Since worker does not need to be registered with any daemon to execute tasks, worker nodes can be extended without downtime or service restart, that is, they can be extended at any time. # Vertical scaling > you can scale the cluster vertically by increasing the number of daemons on a single worker node. You can do this by modifying the value of celeryd_concurrency in the configuration file-{AIRFLOW_HOME} / airflow.cfg of airflow, for example:
Celeryd_concurrency = 30
> you can increase the number of concurrent processes to meet the actual needs according to the actual situation, such as the nature of tasks running on the cluster, the number of cores of CPU, etc. # expand Master node (highly available) > you can also add more master nodes to the cluster to expand the services running on the master node. You can extend the webserver daemon to prevent too many HTTP requests from appearing on one machine, or you want to provide higher availability for webserver's services. One thing to note is that only one scheduler daemon can be run at a time. If you have more than one scheduler running, it is possible that a task will be executed multiple times. This may cause some problems with your workflow due to repeated runs. > the following figure shows the architecture diagram of the extended Master node:! [insert image description here] (when https://img-blog.csdnimg.cn/20191227093318702.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)> sees this, some people may ask that scheduler cannot run two at the same time, so if something goes wrong with the node running scheduler, the task will not run at all. > > answer: this is a very good question, but there is already a solution. We can deploy scheduler on two machines and run only the scheduler daemon on one machine. Once the machine running the scheduler daemon fails, start scheduler on the other machine immediately. We can use the third-party component airflow-scheduler-failover-controller to achieve the high availability of scheduler. > > both master install and run failover, so that the real high availability can be achieved. The specific steps are as follows: > 1. Download failover
Git clone https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
> 2. Use pip for installation
Cd airflow-scheduler-failover-controller
Pip install-e.
> 3. Initialize failover
Scheduler_failover_controller init
> Note: during initialization, the content will be appended to airflow.cfg, so you need to install airflow and initialize it first. > 4. Change failover configuration
Scheduler_nodes_in_cluster= host1,host2
Note: host name can be obtained from the scheduler_failover_controller get_current_host command > 5. Configure secret-free login between the machines where failover is installed. After the configuration is complete, you can use the following command to verify:
Scheduler_failover_controller test_connection
> 6. Start failover
Scheduler_failover_controller start
> Note: when failover monitors scheduler, it will fail to start. You should modify the / root/airflow/airflow.cfg
Airflow_scheduler_start_command = export AIRFLOW_HOME=/root/airflow;;nohup airflow scheduler > > ~ / airflow/logs/scheduler.logs &
Change to airflow_scheduler_start_command = airflow scheduler-D
> so the more robust architecture diagram is as follows:! [insert picture description here] (high availability of https://img-blog.csdnimg.cn/20191227093337590.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2dxajU0MDgyMTYx,size_16,color_FFFFFF,t_70)##### queue service and Metabase (Metestore). > the queuing service depends on whether the message queues used are highly usable and deployable, such as RabbitMQ and Redis. > RabbitMQ cluster and configure Mirrored mode see: http://blog.csdn.net/u010353408/article/details/77964190> Metabase (Metestore) depends on the database used, such as Mysql, etc. > for Mysql as master and slave backup, see: specific steps for http://blog.csdn.net/u010353408/article/details/77964157#### airflow cluster deployment # prerequisites > the daemon processes that nodes run are as follows: * master1 run: webserver, scheduler* master2 run: webserver* worker1 run: worker* worker2 run: worker > queue service is running. (RabbitMQ, Redis, etc) * for more information on how to install RabbitMQ, please see: if http://site.clairvoyantsoft.com/installing-rabbitmq/* is using RabbitMQ, it is recommended that RabbitMQ be deployed in a highly available cluster and load balancers are configured for RabbitMQ instances. # step > 1. Install Apache Airflow on all machines that need to run daemons. > 2. Modify the {AIRFLOW_HOME} / airflow.cfg file to ensure that all machines use the same configuration file. > 1. Change Executor to CeleryExecutor
Executor = CeleryExecutor
> 2. Specify Metabase (metestore)
Sql_alchemy_conn = mysql:// {USERNAME}: {PASSWORD} @ {MYSQL_HOST}: 3306/airflow
> 3. Set broker > if you use RabbitMQ
Broker_url = amqp://guest:guest@ {RABBITMQ_HOST}: 5672 /
if you use Redis
Broker_url = redis:// {REDIS_HOST}: 6379 use database 0
> 4. Set the result storage backend backend
Celery_result_backend = db+mysql:// {USERNAME}: {PASSWORD} @ {MYSQL_HOST}: 3306/airflow
# of course, you can also use Redis: celery_result_backend = redis:// {REDIS_HOST}: 6379Universe 1
> 3. Deploy your workflow (DAGs) on master1 and master2. > 4. In master 1, the Metabase of the initial airflow
$airflow initdb
> 5. In master1, start the corresponding daemon
$airflow webserver
$airflow scheduler
> 6. In master2, start Web Server
$airflow webserver
> 7. Start worker in worker1 and worker2
$airflow worker
> 8. Use load balancer to deal with webserver > you can use servers such as nginx,AWS to handle webserver load balancer. All of them have been clustered or highly available, and the apache-airflow system is indestructible.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.