Opportunity and Challenge of Flink big data Computing 04/28 Update SLTechnology News&Howtos

Opportunity and Challenge of Flink big data Computing

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Wang Shaoyi (Dasha)

This article is from Wang Shaoyi's Flink China Meetup on August 11, 2018.

Wang Shaoyi, nickname "Dasha", Ph.D. in computer Engineering from the University of California, San Diego, Apache Flink Commiter. At present, he is in charge of Flink platform and ecological work in Ali.

The content of this paper is as follows:

Core technology of stream computing

Flink was created by data Artisans in Germany, and early Flink mainly did partial batch computing, but Spark already had certain advantages in batch processing, and frontal competition was meaningless, so it changed direction and began to do flow computing based on chandy-lamport algorithm. After completion, it perfectly solved the problem of low latency and state management.

Low latency, fast fault tolerance

Low latency is derived from Flink, which, of course, ensures fast fault tolerance. In big data's calculation, job always fails, so you need to be able to recover quickly. If the usual delay is very low, but once Job fails, it will be unacceptable to recover for a few minutes.

General API, easy to use

After Flink had the basic capabilities, it began to think about generic API, starting with some Java and some API from Scala. But to a certain extent, because API is not just open to developers, but to all users. How to more easily meet the needs of users and support users, this is the core of flow computing.

Elasticity, high performance

Elasticity and high performance are big data's constant theme. How to ensure that the engine runs without problems on thousands of machines, scalability is very important, including Spark early to a certain scale encountered a lot of problems, of course, Blink has solved all the problems perfectly. In terms of performance, Flink already has an absolute advantage not only in streaming computing but also in batch processing.

Unity of flow and batch

The early interface of Flink was very weak, including the early Spark, so the stream computing community began to discuss what the SQL of stream computing looked like, so there were two styles: one thought that Streaming SQL was a kind of different SQL and Batch Sql, and the other was that SQL was exactly the same as Batch SQL.

Why do you say exactly the same? A basic difference between flow computing and batch computing is that both are computations, but flow computing needs to see the results in advance, which needs to be sent out in advance, but the data coming later will correct the previous results. So the big difference between flow computing and batch computing is that data is sent in advance and data is modified to ensure that the data is correct.

How to deal with this problem:

First of all, we need to tell the user API, how to calculate it is entirely the user's semantics.

The other two points, that is, when to send it and when to correct it, have nothing to do with the description of SQL itself.

So the traditional ANSI SQL can completely describe stream computing. The semantics of Flink SQL is what ANSI SQL users want.

High performance

Advanced analysis

Easy to develop

Right out of the box

Low latency

Cdn.xitu.io/2019/4/25/16a539a2cf310379?w=1055&h=563&f=jpeg&s=123260 ">

We are talking about big data, not just stream computing. For functional users, they are more concerned about ease of use, how to do a good analysis, how to develop better, and how to get started more easily. I haven't studied computer, but any other industry could be statistics, biology, construction, finance. How to develop it more quickly.

If the boss says that he is going to deploy Flink today, so he gives you 50 machines, and the next day, when you deploy, your homework runs, and the boss is petrified and thinks your KPI is very good. So use it out of the box, making it easier to develop what is needed by users.

Traditional batch computing has to pursue performance. At present, stream computing has an increasing demand for performance.

I. the present situation and Future of Flink

Knowing what users want, let's take a look at the status quo of Flink.

Flink is widely used in ultra-low latency flow computing scenarios, but Flink already has very high batch performance in batch processing, and is unified in API upstream and batch, and has a good performance in terms of performance and ease of use.

With what is known and a little unknown, let's take a look at some of the things that Flink can do: stream computing is very mature, batch computing, AI computing, including TF ON Flink,training, prediction, any calculation. In addition, there is a large piece of IOT,Hadoop Summit that emphasizes that among all kinds of data, streaming or batch, the data of IOT is the largest in the end. Although not every company is exposed to IOT, it is definitely a big future.

1. Alibaba's Blink

Blink1.0 is actually the enterprise version of Flink, which focuses on streaming computing.

Blink2.0 is a unified engine that supports streaming and batch processing, and has made great improvements in other aspects, such as AI, and has far outperformed Spark in batch performance. Giving back to the community is also the same version.

Architecture of 2.Flink SQL Engine

Let's take a look at Flink SQL Engine, starting with Query's API, having Query Optimization, then translating it to DataSteam or DataSet operator, and then Runtime, running on each cluster. This architecture expands DataSteam and DataSet inside, and you can see several big problems:

In design, it has never been thought of to be unified. Finally, after Query Optimization translation, there are two separate pipline to DataStream or DataSet, and the code below cannot be reused.

Another can look at batch computing. There is also an Optimized Plan under DataSet. The optimization of these two layers brings great difficulties to unification.

Architecture of 3.Blink SQL Engine

Let's replace the entire SQL Engine with the one shown above. From the upper API to the lower Query Processor includes Query Optimizer and Query Executor, when these discoveries are made, the code is greatly reduced and reused, and a job only needs to identify whether it is Batch Mode or Stream Mode with the same SQL, and it will get the same result.

Starting with API, translating Logical Plan to Optimizer, and then to Physical Plan like writing DataStream, we can see that batches before Optimizer are exactly the same as streams, SQL and Logical Plan are the same. That is, what is in the user's mind is exactly the same in batch and stream.

two。 Challenges and opportunities for optimizing flow Computing

After Optimizer, streams and batches are a little different.

Batch and stream in the same place is some simple filter,predicate,projection and joining reorder.

The difference is that we do not support sort in flow computing, because as soon as each piece of data comes, we have to update the previous data, just like I asked everyone here to weigh the individual and arrange a sequence, and all of a sudden who goes to the toilet, the weight changes, it will affect the ranking of many people, and a lot of results need to be changed. So don't think about things like sort on the stream. But the stream because of the use of state, how to make its performance become very high, reduce Retraction, how to let users' SLA use MicroBatch to optimize.

Once stream computing becomes SQL, you have to run the standard SQL test, TPC-H,TPC-DS. Let's look at this TPCH13, and this is tested with a Customer table and an Order table, and we need to do join and count.

This calculation is very convenient in batch computing, because the two tables are there, it obviously knows that the user table is very small, it will first cache the user table hash to various places, and then let the order table flow past, this performance is very high, because the largest table Order just keeps streaming and landing.

How to deal with the flow calculation? Because you don't know what the data looks like, you have to save it on each side, and the Customer table on the left is saved after it comes, because you only need to save one row, so you use ValueState, but each user has a lot of Order, and the Order table on the right needs to use MapState, which requires a lot of computation and poor performance. How to optimize it, the SQL we use has a natural benefit, Optimizer. SQL Engine has a rule, that is, there is an algebraic optimization in the countAgg above and join,SQL below. Regardless of what the data looks like, I think algebraically that the calculation results of the middle picture and the rightmost picture are the same, so I can agg both sides first. I can first turn each user's count into a row with only one data on the other side of the Order, and process the data in advance. In this way, the Order table is compressed into a table the same size as customer, which saves a lot of overhead on join. State has changed from a huge MapState to a lightweight ValueState, and the performance has been improved by 25 times, which is why SQL makes sense.

For some specific optimizations of stream computing, such as knowing the user's SLA, you can configure mini-batch for a while.

To do the count of the whole network, then use the red and purple of the left picture above, and send them to one place for statistics. Without preprocessing, the red node load is too high, which will soon lead to reverse pressure. The best way is that the red and purple nodes are now upstream chain to do preprocessing, which is equivalent to dividing an aggregation into two parts, first doing count, and then doing sum.

Of course, the above scheme is not always effective, such as count distinct, which also needs to distinct by color group by and by a certain column, so that different data cannot be pre-aggregated. So on local-global, in addition to chain, there is also shuffle, which is equivalent to shuffle twice, which is what we call breaking up in stream computing. The first time you press distinct key to shuffle, the second time you use group by's key to do shuffle. Of course, SQL Engine will do all these things for you automatically.

three。 Integrate into the open source community and participate in open source development

In addition to the contribution of coding, the open source community also has documentation, ecology, community, and products, as long as they are helpful to this open source product. What is more important is your activity in the community and what problems you solve for the community.

As a user, you can ask some questions, go to mailing list to answer questions, do testing and report, etc.

As a developer you can go to review code, including your own idea, big refactoring. It can also help other users answer questions.

Mailing lists:

Developers ask questions and communicate.

Users ask questions and communicate.

JIRA: https://issues.apache.org/jira/browse/FLINK

It's the way the community works. Where proposed by Bug,feature,improvements, the contribution of each code is associated with a JIRA issue.

Wiki: https://cwiki.apache.org/confluence/display/FLINK

There are many documents, including a lot of FLIP, of course, waiting for you to contribution.

So how to participate in the development?

You need to present your ideas in the community and collect some suggestions.

If you ask PMC,commiter to be responsible for which part of the code you are responsible for, you can contact him and ask him to help you review.

You can rely on JIRA to handle some small problems, but the more significant improvements still depend on FLIP.

After completion, you need to contribute the code, of course, to ensure the quality of the code, add a lot of test case, when you pull request, there will be a lot of people review your code, no problem will merge up.

For more information, please visit the Apache Flink Chinese Community website.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.