The misuse of Apache Flink is example analysis. 07/03 Update SLTechnology News&Howtos

The misuse of Apache Flink is example analysis.

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Apache Flink misuses example analysis. In view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

Abstract: the following is sorted out according to the Chinese essence version of Flink Forward Global online Conference, focusing on the start of the project, requirement analysis, development, and the whole life cycle of testing, launch, and operation and maintenance, introduces some typical misuses in Apache Flink practice, and gives the corresponding better practice plan. The first misuse in Flink practice is not following the iterative development process.

1. Project start

Before we start development, we need to choose the right way to cut in, and the following are often the worst starts:

A) start with a challenging use case (end-to-end Exactly-once, large state, complex business logic, combination of strong real-time SLA) b) have no flow processing experience c) do not train the team d) do not take advantage of the community

In the process of development, in fact, we should seriously plan our entry point, first of all, we should start with simple tasks step by step. It is necessary to have a certain amount of knowledge of big data and stream processing, participate in some training as far as possible, and make good use of community resources. Based on this idea, we can quickly find an entry point.

How to do it? The community provides a lot of training, including various training courses on Flink Forward and Vererica websites, which you can go to. At the same time, we can make full use of the community. The community has also set up a Chinese mailing list, which you can make full use of to solve the problems at hand. In addition, Stack Overflow is also a good place to ask questions, but try to take a look at existing questions before asking questions and make sure you have a clear idea.

Mailing list:

User@flink.apache.com/user-zh@flink.apache.org

Stack Overflow:

Www.stackoverflow.com

two。 Design and analysis

Some common misconceptions in solution design are often caused by insufficient consideration of requirements, such as:

A) do not consider data consistency and delivery assurance b) do not consider business upgrade and application improvement c) do not consider business scale d) do not think deeply about the actual business needs, we should carefully analyze the requirements and seriously consider the actual delivery situation. When it comes to consistency and delivery assurance, you can actually guide you through a few questions, as shown in the following figure:

The first question, do you care about the loss of data?

If you don't care, you can do without Checkpoint.

The second question, does it care about the correctness of the results?

In many scenarios, we are very concerned about the correctness of the results, such as the financial sector, but other scenarios such as monitoring or other simple usage scenarios only require a summary of data statistics. If you don't care about the correctness of the result, consider configuring it in at-least-once mode and using a playable data source. Conversely, if the accuracy of the results is important and the downstream does not care about duplicate records, just set the exactly-once mode and use a playable data source. If the downstream requires that the data cannot be repeated and can only be sent once, even if the data is correct, there are further restrictions on sink. In exactly-once mode, a playable data source is used, and sink needs to support transactions.

By analyzing the business with this way of thinking, you can know very clearly how to use Flink and avoid something bad.

After completing the analysis, what is the ultimate goal? Why should we have this choice instead of choosing the best one in the first place?

Because there is never the "best" in the world, the core factor here is delay, which should be balanced according to the delay and accuracy needs of the business.

After all the requirements have been analyzed, you still need to think about whether the application needs to be upgraded. In terms of a normal Flink assignment, we have several questions to consider. First, Flink jobs generally have state reading, which needs to be guaranteed by savepoint mechanism when upgrading, keep the state storage at the remote end, and then restore to the new job. There will be upgrade requirements in many scenarios, which simply lists a few points:

A upgrade cluster version b business bug repair c business logic (topology) changes

In more complex scenarios, the job will have topology changes, as shown in the following figure:

You need to add an operator here and remove a sink. For such a change, we need to consider the recovery of the state. When Flink discovers that the node of the new job is missing and the corresponding state cannot be recovered, it will throw an exception and cause the upgrade to fail. At this point, you can use the parameter, allowNonRestoreState, to ignore such problems.

In addition, there is a new node in the new job, which can be initialized with an empty state. In addition, it is important to note that in order to ensure that the job starts successfully and the state recovery is not affected, we should set the uid in StreamAPI for the operator. Of course, if the structure of the state changes, the types of Avro Types and POJO are supported, and Kryo does not. Finally, it is recommended that all types of key should not be modified as far as possible, as this will involve shuffle and state correctness.

Resource usage is also one of the factors that must be considered. Here is a way to evaluate memory and network IO usage. Here we assume that we are using Fs State, and all runtime states are in memory. Improper allocation of resources may cause serious problems such as OOM.

After completing the resource assessment, you also need to consider the event time and disorder. Here is a concrete example:

Which time window to choose and when to trigger the calculation in this example cannot be clearly described by the requirements of a single sentence. Only according to the characteristics of stream processing combined with the actual business to carefully analyze the requirements, can the Flink technology be used properly.

It should also be noted that Flink is a unified computing engine for streaming batches, and not all businesses can be implemented with streaming or batch processing. You need to analyze which way your scenario is suitable for implementation.

3. Development

3.1Choices of API

In the choice of DataStream API and Table API/SQL, if there is a strong need to control the state and the behavior of each state, you can use DataStream API; if it is a simple data extraction and relational algebra operation, you can choose Table API/SQL. In some scenarios, you can only choose DataStream API:

A) change state during upgrade b) do not lose late data c) change the behavior of the program at run time

3.2 data types

During the development process, there are two misuse scenarios for data types:

A) use of deeply nested complex data types b) use any type in KeySelector

The right thing to do is to choose the type of state that is as simple as possible, without using types that Flink cannot automatically recognize in KeySelector.

3.3 Serialization

The simpler the data type, the better, based on the serialization cost, try to use POJO and Avro SpecificRecords. You are also encouraged to debug tools that use IDE locally to see where the performance bottleneck is.

Serializer Opts/sPojoSeriallizer813Kryo294Avro (Reflect API) 114Avro (SpecificRecord API) 632

Figure 5 is an inefficient process, and we should first filter and project to prevent unnecessary processing of unwanted data.

3.4 concurrency

Two misuse scenarios and corresponding problems that are easy to cause:

Sharing static variables between tasks

Easy to cause bug;, easy to cause deadlock and competition problems; bring additional synchronization overhead.

Generate threads in user functions

Checkpoints become complex and error-prone.

For cases where you want to use threads, if you need to accelerate the job, you can adjust the parallelism and resources. If you need to trigger some scheduled tasks with asynchronous IO;, you can use the Timer that comes with Flink to schedule tasks regularly.

3.5 window

Try to avoid customizing Window like figure 6, and using KeyedProcessFunction can make the implementation simpler and more stable.

In addition, avoid the sliding window in figure 7, where each record is calculated by 500000 windows, both in terms of computing resources and business latency.

3.6 searchable status

Queryable State is still under continuous improvement and can be used for monitoring and query, but there are still some problems that need to be paid attention to when it is put into production. For example, the RocksDB status backend supports thread-safe access, but the FS status backend does not support it. In addition, there are performance and consistency issues that need to be paid attention to.

3.7Application of DataStream API

For a scenario like figure 8, you can use the DataStreamUtils#reinterpretAsKeyedStream method to avoid multiple shuffle facing the same key.

For a scenario like figure 9, you should write some initialization logic in the open method of RichFunction.

4. test

In addition to system testing and UDF unit testing, you should also do Mini Cluster testing. Running a Mini Cluster on the machine to run end-to-end business can find some problems as soon as possible.

There is also Harness testing, which accurately helps with stateful task testing. It can accurately control watermark, event time of elements, and so on. Please refer to:

Https://github.com/knaufk/flink-testing-pyramid .

5. Upper line

Many scenarios will lead to business jitter. One is that the actual business itself has jitter. Other normal phenomena, such as Timer, CP alignment, GC, and other normal phenomena, as well as scenarios of chasing data, are different at the beginning and even. In this case, do not worry, consciously identify this situation, and then determine whether this is a normal or unexpected situation.

When monitoring online, it should be noted that too much metrics will cause great pressure on JVM. Do not select subtask for reporting frequency, which will cost a lot of resources.

It should be noted that the RocksDB status backend is not used at the beginning, and the deployment cost of the FS status backend is lower and faster. Reduce the use of network file systems. The configuration of SlotSharingGroups should be default as far as possible to avoid causing damage to the under-mechanism and leading to a waste of resources.

6. Maintain

For fast-paced projects like Flink, there are a lot of bug fixes for each version, and it's important to upgrade in a timely manner.

Supplement to 7.PyFlink/SQL/TableAPI

Use TableEnvironment or StreamTableEnvironment? TableEnvironment is recommended. (piecewise optimization)

State TTL is not set, resulting in unlimited growth of State, or State TTL setting does not meet business requirements, resulting in data correctness problems.

Job upgrade is not supported, for example, adding a COUNT SUM will cause job state incompatibility.

When parsing JSON, repeatedly schedule UDF, which seriously affects performance. It is recommended to replace it with UDTF.

When multi-streaming JOIN, first make a small table JOIN, and then do a large table JOIN. At present, Flink does not have the meta information of the table, so it is impossible to do join reorder automatically when plan is optimized.

On the misuse of Apache Flink is the sample analysis questions to share the answers here, I hope the above content can be of some help to you, if you still have a lot of doubts to solve, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.