How to practice the construction of real-time data warehouse based on Flink + Kafka 07/13 Update SLTechnology News&Howtos

How to practice the construction of real-time data warehouse based on Flink + Kafka

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to build a real-time data warehouse based on Flink + Kafka. The editor thinks it is practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

First, background introduction

(1) General framework of streaming platform

At present, the general architecture of streaming platform generally includes message queue, computing engine and storage. The general architecture is shown in the following figure. The log logs of the client or web will be collected to the message queue; the computing engine calculates the data of the message queue in real time; and the real-time calculation results are stored in the real-time storage system in the form of Append or Update.

At present, our commonly used message queue is Kafka, and at the beginning of the computing engine, we use Spark Streaming. With the advantages of Flink in the stream computing engine becoming more and more obvious, we finally determine Flink as our unified real-time computing engine.

(2) Why did you choose Kafka?

Kafka is an early message queue, but it is a very stable message queue with many user groups, and NetEase is also one of them. The main reasons why we consider Kafka as our messaging middleware are as follows:

High throughput, low latency: hundreds of thousands of QPS per second with millisecond latency

High concurrency: supports thousands of clients to read and write simultaneously

Fault tolerance, high performance: supports data backup and allows nodes to be lost

Scalability: support hot expansion, will not affect the current online business.

(3) Why did you choose Flink?

Apache Flink is an open source big data streaming computing engine that has become more and more popular in recent years. It supports both batch processing and streaming processing. Considering Flink as our streaming computing engine, the main factors are:

High throughput, low latency, high performance

Highly flexible streaming window

Exactly-once semantics of State Computing

Lightweight fault-tolerant mechanism

Support EventTime and out-of-order events

Flow batch unified engine.

(4) Kafka + Flink stream computing system

Based on the dazzling performance of Kafka and Flink in message middleware and streaming computing, there is a flow computing platform architecture based on Kafka and Flink, as shown in the following figure: the real-time logs generated are collected to Kafka based on APP, web and other ways, and then handed over to Flink for real-time computing such as ETL, global aggregation and Window aggregation.

(5) the current situation of NetEase Yun's use of Kafka in music

At present, we have 10 + Kafka clusters, and the main tasks of each cluster are different, some as business clusters, some as mirror clusters, some as computing clusters, and so on. At present, the total number of nodes in the Kafka cluster reaches 200 million, and the peak QPS of a single Kafka is 400W +. At present, NetEase Yun Music's real-time tasks based on Kafka+Flink have reached 500 +.

II. Flink+Kafka platform design

Based on the above situation, we want to do a platform development of Kafka+Flink to reduce users' development costs and operation and maintenance costs. In fact, we started to build a real-time computing platform based on Flink in 2018, in which Kafka plays an important role. This year, we have refactored to make it easier and easier for users to use Flink and Kafka.

Based on Flink version 1.0, we have done a Magina version refactoring. At the API level, we provide Magina SQL and Magina SDK operations through DataStream and SQL; then we will convert these SQL into LogicalPlan through custom Magina SQL Parser, and convert LogicalPlan into physical execution code. In the process, we will go through catalog to connect to the metadata management center to get some metadata information. In the process of using Kafka, we register the Kafka metadata information to the metadata center, and the access to real-time data is in the form of flow tables. In Magina, we mainly do three parts of work on the use of Kafka:

Cluster catalog

Topic flow tabulation

Message Schema.

Users can register different table information or catalog information in the metadata management center, or create and maintain Kafka tables in DB. Users only need to use the corresponding tables according to their personal needs. The following figure shows the main reference logic to the Kafka flow table.

3. Application of Kafka in real-time data warehouse.

(1) Development in solving problems

Kafka in the use of real-time data warehouse process, we encountered different problems, in the middle also tried different solutions.

At the beginning of the platform, there were only two clusters used for real-time computing, and there was a collection cluster, and the amount of data per Topic was very large; different real-time tasks would consume the same large amount of data, and the Topic,Kafka cluster IO pressure was extremely high.

Therefore, in the process of use, it is found that the pressure of Kafka is unusually high, and there is often a delay and a surge in Imax O.

We want to distribute large Topic in real time to solve the above problems. Based on Flink 1.5, we designed the data distribution program shown in the following figure, that is, the prototype of real-time data warehouse. Based on this method of distributing a large Topic into a small Topic, the pressure on the cluster is greatly reduced and the performance is greatly improved. in addition, static distribution rules are initially used, and tasks are restarted later when rules need to be added, which has a great impact on the business. Then we consider using dynamic rules to complete the task of data distribution.

After solving the problems encountered in the initial stage of the platform, Kafka is faced with new problems in the process of platform upgrading:

Although the cluster has been expanded, the amount of tasks is also increasing, and the pressure on the Kafka cluster is still rising.

Sometimes there are problems related to the rise of cluster pressure, and consumption tasks are easy to interact with each other.

Users consume different Topic processes without the landing of intermediate data, which is easy to cause repeated consumption.

It is difficult for tasks to migrate Kafka.

To solve the above problems, we have carried out the Kafka cluster isolation and data tiering as shown in the following figure. To put it simply, the cluster is divided into DS cluster, log collection cluster and distribution cluster. The data is distributed to Flink for processing through the distribution service, and then entered into the DW cluster through data cleaning. At the same time, it will be synchronized to the mirror cluster in the process of DW writing. In this process, Flink will also be used for real-time computing statistics and splicing, and the generated ADS data will be written to the online ADS cluster and statistical ADS cluster. Through the above process, it is ensured that the tasks that require high real-time computing will not be affected by statistical reports.

Through the above process, it is ensured that the tasks that require high real-time computing will not be affected by statistical reports. But after we distribute different clusters, we will inevitably face new problems:

How to perceive the state of the Kafka cluster?

How to quickly analyze Job consumption anomalies?

To solve the above two problems, we have built a Kafka monitoring system. The monitoring is divided into the following two dimensions, so that when an exception occurs, we can determine the details of the problem:

Monitoring of cluster profiles: you can see the number of Topic and running tasks corresponding to different clusters, as well as the amount of data, data inflow, total inflow and average data size of each Topic consumption task.

Metric monitoring: you can see the Flink task and the corresponding Topic, GroupID, cluster, startup time, input bandwidth, InTPS, OutTPS, consumption delay and Lag.

(2) the application of Flink + Kafk a under the Lambda framework

Flow batch unification is a very popular concept at present, and many companies are also considering the application in this area. at present, the commonly used architecture is either Lambda architecture or Kappa architecture. For streaming batch unification, we need to consider storage unification and computing engine unification. Since there is no unified storage in our current infrastructure, we can only choose Lamda architecture.

The following figure shows the specific practice of Lambda architecture based on Flink and Kafka in cloud music. The upper layer is real-time computing, the lower layer is offline computing, horizontal is divided by computing engine, vertical is divided by real-time data warehouse.

IV. Problem & improvement

In the process of specific application, we have also encountered a lot of problems, the two main problems are:

The problem of Kafka Source repeated consumption under Multi-Sink

The problem of computing delay in the surge of traffic on the same switch.

(1) the problem of Kafka Source repeated consumption under multi-Sink

Multiple Sink is supported on the Magina platform, which means that any intermediate result can be inserted into different storage during the operation. In this process, there will be a problem, compared to an intermediate result, we insert different parts into different storage, then there will be multiple DAG, although they are temporary results, but it will also cause repeated consumption of Kafka Source, resulting in a great waste of performance and resources.

So we wondered whether we could avoid the multiple consumption of temporary intermediate results. Prior to version 1.9, we rebuilt StreamGraph and merged the DAG of three DataSource; in version 1.9, Magina itself provided an optimization for query and Source merge But we found that if there are references to multiple Source of the same table in the same data update, it will merge itself, but if it is not in the same data update, it will not be merged immediately, so after version 1.9, we did a buffer to modifyOperations to solve this problem.

(2) delay in computing the surge of traffic on the same switch

This problem is a recent problem, and it may be not only in the same switch, but also in the same computer room. We deployed a lot of machines under the same switch, some of which deployed Kafka clusters, and some deployed Hadoop clusters. On Hadoop, we may carry out offline calculation of Spark and Hive and real-time calculation of Flink, and Flink will also consume Kafka for real-time calculation. In the course of running, we found that there would be an overall delay in a certain task, and no other anomalies were found after checking. In addition to the surge in browsing by the switch at a certain point in time, further investigation found that it was a surge in browsing for offline computing, and because of the bandwidth limitations of the same switch, it affected the real-time calculation of Flink.

To solve this problem, we consider to avoid the interaction between offline clusters and real-time clusters, and optimize switch deployment or machine deployment. For example, offline clusters use a single switch, and Kafka and Flink clusters use a separate switch to ensure that they do not affect each other at the hardware level.

5. Q & A

Is the data of Q1:Kafka in real-time data warehouse reliable?

A1: the answer to this question depends more on the definition of data accuracy, and different standards may get different answers. First of all, you should define when the data is reliable, and have a good fault-tolerant mechanism in the process of processing.

Q2: how can we learn the problems encountered in these enterprises when we are learning? How to accumulate these problems?

A2: I think that the process of learning is problem-driven, encounter problems to think about and solve it, in the process of solving to accumulate experience and their own shortcomings.

Q3: in the process of dealing with Kafka, how to deal with abnormal data? is there a detection mechanism?

A3: in the process of running, we have a distribution service. In the process of distribution, we will detect which data is abnormal and which are normal according to certain rules, and then distribute the abnormal data to an abnormal Topic for query. Later users can view these data in the abnormal Topic according to relevant metrics and keywords.

The above is how to carry out the real-time data warehouse construction practice based on Flink + Kafka. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.