What are the Flink interview questions? 07/19 Update SLTechnology News&Howtos

What are the Flink interview questions?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the Flink interview questions". The explanation in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the Flink interview questions"?

1. Application architecture

Question: how does the company submit real-time tasks and how many Job Manager are there?

Answer:

1. We use the yarn session mode to submit the task. A new Flink cluster is created for each submission, providing a yarn-session for each job. Tasks are independent of each other, independent of each other, and easy to manage. The cluster created after the completion of the task also disappears. The online command script is as follows:

Bin/yarn-session.sh-n 7-s 8-jm 3072-tm 32768-qu root.*.*-nm *-*-d

Among them, 7 taskManager are applied, each with 8 cores, and each taskmanager has 32768m memory.

two。 The cluster has only one Job Manager by default. However, in order to prevent a single point of failure, we have configured high availability. Our company generally configures one primary Job Manager and two standby Job Manager, and then combines the use of ZooKeeper to achieve high availability.

2. Pressure testing and monitoring

Question: how to do stress testing and monitoring?

Answer: the pressure we generally encounter comes from the following aspects:

First, if the speed of generating data flow is too fast, and the downstream operator consumption can not come over, it will produce back pressure. The monitoring of back pressure can be visually monitored using Flink Web UI (localhost:8081), which can be known once the alarm is given. In general, the problem of back pressure may be due to the fact that the sink operator is not optimized, so just do some optimization. For example, if you are writing to ElasticSearch, you can change it to write in bulk, increase the size of the ElasticSearch queue, and so on.

Second, set the maximum delay time of watermark. If the setting is too large, it may cause pressure on memory. You can set the maximum delay time to be smaller, and then send the late element to the side output stream. Update the results later. Or using a state backend like RocksDB, RocksDB opens up out-of-heap storage space, but IO slows down and needs to be weighed.

Third, if the length of the sliding window is too long and the sliding distance is very short, the performance of Flink will decline greatly. We mainly use the time slicing method to store each element in only one "overlapping window", which can reduce the writing of the state in the window processing. (link to details: Flink sliding window Optimization)

Fourth, RocksDB is used in the status backend, and it has not encountered the problem of being burst.

3. Why use Flink

Question: why use Flink instead of Spark?

Answer: the main consideration is that flink has better support for low latency, high throughput and convective data application scenarios; in addition, flink can handle out-of-order data well and ensure the state consistency of exactly-once.

4. The understanding of checkpoint

Question: how to understand Flink's checkpoint

Answer: Checkpoint is the core function of the fault-tolerant mechanism of Flink. It can generate snapshots periodically based on the status of each Operator/task in Stream according to the configuration, thus storing these state data periodically. Once the Flink program crashes unexpectedly, it can selectively recover from these snapshots when rerunning the program, thus correcting program data anomalies caused by failures. It can be stored in memory, file system, or RocksDB.

5. The guarantee of exactly-once

Question: if the subordinate storage does not support transactions, how can Flink guarantee exactly-once?

Answer: end-to-end exactly-once has high requirements for sink, which can be implemented in two ways: idempotent writing and transactional writing. The scenario of idempotent writing depends on business logic, and transactional writing is more common. On the other hand, transactional writes have two ways: pre-write log (WAL) and two-phase commit (2PC).

If the external system does not support transactions, then the result data can be saved as a state by pre-writing logs, and then written to the sink system at one time when you receive the notification of checkpoint completion.

6. State mechanism

Question: tell me about the Flink state mechanism?

Answer: many of the operators built into Flink, including the source source and the data store sink, are stateful. In Flink, a state is always associated with a specific operator. Flink takes a snapshot of the status of each task in the form of checkpoint, which is used to ensure state consistency during fault recovery. Flink manages the storage of state and checkpoint through the state backend, and the state backend can also have different configuration choices.

7. Mass key weight removal

Question: how to weigh it? Consider a real-time scenario: in the Singles Day scenario, the length of the sliding window is 1 hour and the sliding distance is 10 seconds. How to calculate the UV for 100 million users?

Answer: it is obviously not possible to use set data structures like scala or set of redis, because there may be hundreds of millions of Key that cannot be stored in it. So consider using a Bloom filter (Bloom Filter) to remove weights.

8. Comparison between checkpoint and spark

Question: what are the differences and advantages of Flink's checkpoint mechanism over spark?

Answer: spark streaming's checkpoint is only a checkpoint of data and metadata for driver failure recovery. The checkpoint mechanism of flink is much more complicated. It uses lightweight distributed snapshot, which realizes the snapshot of each operator and the snapshot of data in flow.

9. Watermark mechanism

Question: please explain the Watermark mechanism of Flink in detail.

Answer: you will encounter the problem of data disorder when using EventTime to process Stream data. Stream processing is generated from Event (things), flows through Source, and then to Operator, which takes a certain amount of time. Although in most cases, the data transmitted to Operator comes according to the time sequence of events, it does not rule out the occurrence of disorder due to network delay and other reasons, especially when using Kafka, the data between multiple partitions can not be guaranteed to be orderly. Therefore, in the Window calculation, can not wait indefinitely, there must be a mechanism to ensure that after a specific time, must trigger Window for calculation, this special mechanism is Watermark (water mark). Watermark is used to handle out-of-order events.

In the window processing process of Flink, if you are sure that all the data arrive, you can do window calculation operations (such as summarization, grouping, etc.) on all the data of Window. If the data does not all arrive, you will continue to wait for all the data in the window to arrive before processing. In this case, the Water level Line (WaterMarks) mechanism is needed, which can measure the progress of data processing (expressing the integrity of the arrival of the data), ensure that the event data (all) reach the Flink system, or calculate correct and continuous results as expected in the event of disorder and delay.

10. How to implement exactly-once

Question: how is exactly-once semantics implemented in Flink and how is state stored?

Answer: Flink relies on checkpoint mechanism to implement exactly-once semantics. If you want to implement end-to-end exactly-once, you need external source and sink to meet certain conditions. The storage of state is managed by the state backend, which can be configured in Flink.

11 、 CEP

Question: where is the data saved in Flink CEP programming when the state does not arrive?

Answer: in streaming, of course, CEP should support EventTime, so it should also support the tardiness of data, that is, the processing logic of watermark. CEP's handling of unmatched event sequences is similar to late data. In Flink CEP's processing logic, unsatisfied and belated data are stored in an Map data structure, that is, if we limit the length of judging the sequence of events to 5 minutes, then 5 minutes of data will be stored in memory, which, in my opinion, is also one of the great damage to memory.

12. Three kinds of temporal semantics

Question: what are the three temporal semantics of Flink and describe the application scenarios respectively?

Answer:

Event Time: this is the most common time semantics for practical applications. It refers to the time when the event was created, and is often used in conjunction with watermark.

Processing Time: refers to the local system time of each operator that performs time-based operations, related to the machine. Applicable scenarios: when there is no event time, or when the real-time requirement is too high.

Ingestion Time: refers to the time when the data enters Flink. Applicable scenario: in the case of multiple Source Operator, each Source Operator can use its own local system clock to assign Ingestion Time. Ingestion Time in data records will be used for subsequent time-related operations.

13. Processing of data peak

Question: what does the Flink program do in the face of peak data?

Answer: use large-capacity Kafka to put the data in the message queue as the data source, and then use Flink for consumption, but this will affect the real-time performance.

Thank you for your reading, the above is the content of "what are the Flink interview questions?" after the study of this article, I believe you have a deeper understanding of what the Flink interview questions have, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.