Opportunities and challenges faced by Apache Flink ®Ecology 04/24 Update SLTechnology News&Howtos

Opportunities and challenges faced by Apache Flink ®Ecology

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Jian Feng

Introduction

Before we talk about ecology, let's talk about what ecology is. Ecology means that in a specific field, multiple other components derived from a component as the core, these components can be used indirectly or directly to the core component, and then assist the core component to complete a larger or more special task. Flink ecosphere refers to the biosphere with Flink as the core. Flink belongs to the calculation link in big data's ecology, which only does calculation and does not do storage. But in practice, you will find that using Flink alone is often not enough. For example, where your data is read, where the data will be stored after Flink calculation, and how to consume it. How to use Flink to accomplish a special task in a vertical field, and so on. These tasks involving upstream and downstream, or more abstract, require a powerful biosphere to accomplish.

The present situation of Flink Ecology

After explaining what ecology is, let's talk about the current situation of Flink ecology. On the whole, the ecology of Flink is still in a relatively primary stage. At present, Flink ecology mainly focuses on various upstream and downstream connector and support for various clusters.

Up to now, the connector supported by Flink are: Kafka,Cassandra,Elasticsearch,Kinesis,RabbitMQ,JDBC,HDFS and so on, basically supporting all major data sources. In terms of cluster support, Flink currently supports Standalone and YARN. Based on the current ecological situation, Flink is still mainly used in stream data calculation. If you want to use Flink to do some other scenarios (machine learning, interactive analysis) will be more complex, there is a lot of room for improvement in the user experience. This is exactly the challenge and opportunity that Flink ecology faces.

Challenges and opportunities of Flink Ecology

Flink is committed to being a batch-stream unified big data computing platform, but there is still a lot of potential that has not been brought into full play. To give full play to its potential, we need a strong ecosystem. In general, we can look at this ecosystem from two dimensions:

Horizontal dimension. The ecology of the horizontal dimension is mainly to build an end-to-end solution. For example, various connector connecting upstream and downstream data sources, integration with downstream machine learning framework, integration with downstream BI tools, tools to facilitate submission and operation and maintenance of Flink Job, and Notebook to provide a better interactive analysis experience.

Vertical dimension. Vertical dimension refers to a more abstract Flink computing engine to adapt to a variety of computing scenarios. For example, the unification of batch computing, higher computing abstraction layer Table API, complex event processing engine (CEP), higher machine learning computing framework (Flink ML), adaptation to various cluster frameworks, and so on.

The following figure is a description of the whole Flink ecology in horizontal and vertical dimensions.

Cdn.xitu.io/2019/5/5/16a86b23181258b2?w=720&h=540&f=jpeg&s=41473 ">

Next, I will elaborate on several major ecological points one by one.

Integration and support of Hive by Flink

Apache Hive is a top-level Apache project with a history of nearly 10 years. The project initially encapsulated SQL on the basis of the MapReduce engine, users no longer need to write complex MapReduce Job, but only need to write simple familiar SQL statements, and the user's SQL will be translated into one or more MapReduce Job. As the project continues to evolve, Hive's computing engine has developed to be pluggable. For example, Hive now supports MR, Tez, and Spark computing engines. Apache Hive has now become the de facto data warehouse standard in the Hadoop ecosystem, and many companies' data warehouse systems have been running on Hive for many years.

As a batch-stream unified computing framework, the integration of Flink and Hive becomes natural. For example, through Flink to do real-time ETL, build real-time data warehouse, and then use Hive SQL to do real-time data query.

The Flink community has created FLINK-10556 to better integrate and support Hive. The main functions are as follows:

Allow Flink to access Hive's metadata

Allow Flink to access Hive's table data

Flink is compatible with Hive data types

Flink can use Hive UDF

You can use Hive SQL in Flink (including DML and DDL)

The Flink community is gradually implementing these functions. If you want to experience the above features in advance, you can try Alibaba's open source Blink. Open source Blink has docked and connected Flink and Hive in the metadata (meta data) and data layer. Users can directly use Flink SQL to query the data of Hive, which can really switch freely between Hive engine and Flink engine. In order to get through the metadata, Blink reconstructs the implementation of Flink catalog and adds two kinds of catalog, one is FlinkInMemoryCatalog based on memory storage, and the other is HiveCatalog that can bridge Hive MetaStore. With this HiveCatalog,Flink job, you can read Hive's MetaData. In order to get through the data, Blink implements HiveTableSource, so that Flink job can directly read the data of ordinary tables and partitioned tables in Hive. Therefore, with this version, users can use Flink SQL to read existing Hive meta and data for data processing. In the future, Alibaba will continue to increase support for Hive compatibility on Flink, including support for Hive-specific query,data type, Hive UDF, and so on. These improvements will be fed back to the Flink community one after another.

Flink support for Interactive Analysis

Batch processing is another larger application scenario in addition to stream processing. Interactive analysis is a large category of batch processing, especially for data analysts and data scientists.

For interactive analysis, Flink itself needs to be further improved to improve the performance requirements of Flink in interactive analysis. For example, FLINK-11199, it is impossible to share data among multiple Job in the same Flink App. The DAG of each Job is independent. FLINK-11199 is to solve this problem, so as to provide more friendly support for interactive analysis.

In addition, we need to provide an interactive analysis platform for data analysts / data scientists to use Flink more efficiently. Apache Zeppelin has done a lot of work in this regard. Apache Zeppelin is also a top-level project of Apache. Zeppelin provides an interactive development environment and supports Scala,Python,SQL and other languages. in addition, Zeppelin naturally has strong scalability and supports a variety of big data engines, such as Spark,Hive,Pig and so on. Alibaba has done a lot of work to make Zeppelin better support Flink. Users can write Flink code (Scala or SQL) directly in Zeppelin, instead of packing locally, and then manually submit Job with bin/flink script, in Zeppelin you can directly submit Job, and then see the Job results, Job results can be either text form or visualization, especially for SQL results, visualization is particularly important. Here are some key points of Zeppelin's support for Flink:

Three operating modes are supported: Local, Remote and Yarn

Support for running Scala,Batch Sql and Stream Sql

Support for visual static table and dynamic table

Automatically associate Job URL

Support for Cancel Job

Savepoint that supports Flink job

Support advanced features of ZeppelinContext, such as creating controls

Provide 3 tutorial notes: Streaming ETL, Flink Batch Tutorial, Flink Stream Tutorial

Some of these changes are on Flink and some on Zeppelin. Before all these changes are pushed back to the Flink and Zeppelin communities, you can use this Zeppelin Docker Image (examples in the Blink open source documentation has details on how to download and install) to test and use these features. In order to facilitate users to try, we provide three examples of built-in Flink tutorial in this version of zeppelin: one is to do Streaming ETL, and the other two are basic samples to do Flink Batch and Flink Stream. For more information on how to use it, you can refer to the following two links

Https://flink-china.org/doc/blink/ops/zeppelin.html

Https://flink-china.org/doc/blink/quickstart/zeppelin_quickstart.htmlFlink 's support for machine learning

Flink is the most important computing engine component in big data's architecture. At present, the main application scenario is to do traditional data calculation and processing, that is, traditional BI (such as real-time data warehouse, real-time statistical reports, etc.). The 21st century will be a century with the outbreak of AI, more and more enterprises and industries begin to use AI technology to revolutionize their industries. As a big data computing engine, Flink is indispensable in this revolution. Although Flink is not made for machine learning, it will also play an indispensable role in machine learning. There are three major things that Flink can do in the field of machine learning in the future.

Construction of Machine Learning Pipeline

Support of traditional machine learning algorithms

Integration with other deep learning frameworks

Machine learning is mainly divided into two stages: Training and Predication. But Training and Predication are only a small part of machine learning. Before Training, we have to do data cleaning, conversion, Normalization and so on. After Training, we have to do Evaluation of Model. The same is true of the Predication phase. For a complex machine learning system, how to integrate each step well is particularly important for the robustness and expansibility of a system. FLINK-11095 is what the community is doing to this end.

At present, Flink has a flink-ml module that implements some traditional machine learning methods, but it still needs to be further improved.

The Flink community is also actively working in this area to support deep learning. Alibaba has a TensorFlow on Flink project internally. Users can run Tensorflow,Flink in Flink Job to do data processing, and then send the processed data to Tensorflow's Python process for in-depth learning and training. At the language level, Flink is doing support for Python. Currently, Flink only supports API for Java and Scala. These two languages are based on JVM, so they are more suitable for big data engineer of the system, but not for Data Analyst for data analysis and Data Scientist for machine learning. People who do data analysis and machine learning generally prefer to use more advanced languages such as Python and R. At present, the Flink community is also working in this area. First of all, Flink will support Python, and the community has begun to discuss it here, because Python has developed particularly rapidly in recent years, mainly due to the development of AI and Deep Learning. At present, the popular library of Deep Learning provides API of Python, such as TensorFlow,Pytorch,Keras and so on. Through the support of Python language, users can concatenate the whole machine learning Pipeline in one language, thus improving the efficiency of development.

Submission and operation and maintenance of Flink Job

In a development environment, Flink Job is typically submitted by executing the shell command bin/flink run. But in a real production environment, there will be a lot of problems with this approach. For example, how to track the status of managing Job, how to retry if Job fails, how to start multiple Flink Job concurrently, how to easily modify the parameters of submitting Job, and so on. Although these problems can be solved by human intervention, human intervention is the most dangerous in the production environment, and we have to automate all the operations that can be automated. There is indeed a lack of such a tool in the Flink ecosystem. Alibaba already has such a tool inside, and it has been running steadily in the production environment for a long time. It has been proved to be a reliable and stable tool for submitting and maintaining Flink Job. Alibaba is currently preparing for an open source project to divest some of Alibaba's internal dependent components, which is expected to open in the first half of 2019.

Generally speaking, there are many problems and opportunities in Flink ecology at present. The Apache Flink community is constantly trying to build a more powerful Flink ecosystem to give full play to its powerful computing engine capabilities. I hope that people who are interested in participating in it can actively participate. Let's work together to build a healthy and powerful Flink ecosystem.

For more information, please visit the Apache Flink Chinese Community website.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.