Realize the real-time statistics of massive message data through Flink 07/13 Update SLTechnology News&Howtos

Realize the real-time statistics of massive message data through Flink

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Background

Message reports are mainly used to count the distribution of message tasks. For example, the total number of APP users sent to a single push message, the number of successful push to mobile phones, and how many APP users click on the pop-up window to notify and open APP, and so on. Through the message report, we can directly see the flow of the message push, the success rate of message delivery, the user's click on the message and so on.

When providing message push service, in order to better understand the daily push situation, tweets will carry out data statistics from different dimensions and generate message reports. There are a huge number of messages sent out every day, which can reach tens of billions of levels. the offline statistics system we used originally can no longer meet the business needs. With the continuous improvement of business capabilities, we choose Flink as the data processing engine to meet the real-time statistics of massive message push data.

This paper will mainly describe the reasons for choosing Flink, the important characteristics of Flink and the optimized real-time calculation method.

Architecture of offline computing platform

In the initial stage of the message reporting system, we used offline computing, mainly using spark as the calculation engine, the original data was stored in HDFS, and the aggregated data was stored in Solr, Hbase and Mysql:

When querying, according to the filter criteria, there are three main dimensions of the query:

AppId delivery time taskGroupName

The list of taskId can be queried according to different dimensions, and then the corresponding results can be obtained according to the task query hbase, and the corresponding metric data can be distributed, displayed and clicked. When we consider transforming it into real-time statistics, there will be a series of difficulties:

The volume of original data is huge, and the amount of data reaches tens of billions of dollars every day, which needs to support high throughput; need to support real-time query; need to associate multiple data; need to ensure the integrity and accuracy of the data.

Why Flink

What is Flink?

Flink is a distributed processing engine for streaming data and batch data. It is mainly implemented by Java code. At present, it mainly depends on the contribution of the open source community.

For Flink, the main scenario it has to deal with is streaming data. Flink, formerly a research project at Berlin Polytechnic University, was accepted by the Apache Incubator in 2014 and quickly became one of the top projects of ASF (Apache Software Foundation).

Scheme comparison

In order to realize the real-time statistics of a push message report, we considered using spark streaming as our real-time computing engine, but after considering some differences among spark streaming, storm and flink, we decided to use Flink as the computing engine:

For the above business pain points, Flink can meet the following needs:

Flink pushes data through pipes, which enables Flink to achieve high throughput.

Flink is streaming processing in the real sense, with lower latency, and can meet the real-time requirements of our message report statistics.

Flink can rely on the powerful window function to realize the incremental aggregation of data; at the same time, the join operation of data can be carried out in the window.

Our message report involves the settlement of the amount, so when errors are not allowed, Flink relies on its own exact once mechanism to ensure that our data will not be consumed repeatedly or missed.

Important features of Flink

Let's talk about some important features in Flink and how to implement it:

1) low latency and high throughput

The main reason why Flink is so fast is its flow processing model.

Flink uses the Dataflow model, which is different from the Lambda model. Dataflow is a graph composed of pure nodes, in which nodes can perform batch computing, stream computing, or machine learning algorithms. The stream data flows between nodes and is processed by the real-time apply of the processing function on the node. The nodes are connected by netty, and the keepalive between the two netty. Network buffer is the key to natural backpressure.

After logical and physical optimization, the logical relationship of Dataflow is not much different from the physical topology of the runtime. This is pure streaming design, and the delay and throughput are theoretically optimal.

To put it simply, when a piece of data is processed, it is serialized to the cache and then immediately transmitted over the network to the next node, where the next node continues to process.

2) Checkpoint

Flink implements checkpoint through distributed snapshots and can support Exactly-Once semantics.

Distributed snapshot is an algorithm based on Chandy and Lamport in 1985, which is used to generate consistent snapshots of the current state of distributed systems without losing information and recording duplicates.

Flink uses a variant of the Chandy Lamport algorithm that periodically generates state snapshots of running flow topologies and stores them in persistent storage (for example, to HDFS or in-memory file systems). The storage frequency of checkpoints is configurable.

3) backpressure

The reason for the emergence of back pressure is to cope with short-term data spikes.

The back pressure of the old version of Spark Streaming is achieved by limiting the maximum consumption speed, and for the Receiver-based form, we can configure spark.streaming. The receiver.maxRate parameter to limit the maximum number of recorded data that can be received per second per receiver.

For the data reception of Direct Approach, we can configure spark.streaming. The kafka.maxRatePerPartition parameter limits the maximum number of records read per Kafka partition per job.

However, this is very inconvenient. Before the actual launch, the cluster needs to be pressure tested to determine the size of the parameters.

The building blocks of the Flink runtime are operators and streams. Each operator consumes an intermediate / transitional stream, transitions them, and then produces a new stream.

The best analogy to describe this mechanism is that Flink uses a valid distributed blocking queue as a bounded buffer. Just as blocking queues are common in Java to connect with processing threads, once the queue reaches its capacity limit, a relatively slow receiver slows down the sender.

Real-time calculation of message report

After optimization, the architecture is upgraded to the following:

As you can see, we have made the following optimizations:

Flink replaces the previous spark for real-time calculation of message reports; ES replaces the previous Solr.

For real-time computing of Flink, we mainly focus on the following four aspects:

ExactlyOnce ensures that the data will only be consumed once, and the powerful time window with the ability of state management will flow and batch together.

In order to meet the needs of our real-time statistical reports, we mainly rely on the incremental aggregation function of Flink.

First of all, we set Event Time as the type of time window to ensure that only the data for the day is calculated; at the same time, we incrementally count the message report for the day every other minute, so we allocate a 1-minute time window.

Then we use .aggregate (AggregateFunction af, WindowFunction wf) to do incremental aggregation operations, which can use AggregateFunction to aggregate data in advance and reduce the storage pressure on state. After that, we write the incrementally aggregated data to ES and Hbase.

The process is as follows:

At the same time, in the query, we query through taskID, date and other dimensions, first obtain the collection of taskID from ES, and then query hbase through taskID to get the statistical results.

Summary

By using Flink, we realize the real-time statistics of message push data, and can view the data indicators such as message delivery, display and click in real time. At the same time, with the powerful state management function of FLink, the stability of the service is also guaranteed to a certain extent. In the future, GE Tweet will continue to optimize the message push service and introduce Flink to other business lines to meet the requirements of some real-time business scenarios.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.