Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to optimize the data skew of keyby window in flink

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

Today, I will talk to you about how to optimize the data tilt of the keyby window in flink, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

In the field of big data processing, data tilting is a very common problem, let's briefly talk about how to deal with the problem of streaming data tilting in flink.

Let's first take a look at a sql that may generate data skew.

Select TUMBLE_END (proc_time, INTERVAL'1' MINUTE) as winEnd,plat,count (*) as pv from source_kafka_table

Group by TUMBLE (proc_time, INTERVAL'1' MINUTE), plat

In this sql, we count the pv of each end of a website. The data consumed from kafka is first grouped according to the end, and then the aggregate function count is executed to calculate the pv. If the data generated by one end is very large, for example, the data generated by our WeChat Mini Programs end is much larger than that of other app ends, then after grouping the data into a certain operator, the data tilt will occur because the processing speed of this operator cannot keep up.

Look at the ui of flink and you will see the following scenario.

Image

For this simple data skew, we can add random numbers to the key of the packet, break it up again, calculate the pv number of different packets after the fragmentation, and then wrap another layer in the outermost layer to aggregate the scattered data again, so as to solve the problem of data skew.

The optimized sql is as follows:

Select winEnd,split_index (plat1,'_',0) as plat2,sum (pv) from (

Select TUMBLE_END (proc_time, INTERVAL'1' MINUTE) as winEnd,plat1,count (*) as pv from (

In the innermost layer, break up the grouped key, that is, plat, with a random number

Select plat | |'_'| | cast (cast (RAND () * 100as int) as string) as plat1, proc_time from source_kafka_table

) group by TUMBLE (proc_time, INTERVAL'1' MINUTE), plat1

) group by winEnd,split_index (plat1,'_',0)

In the innermost layer of this sql, the key of the packet, that is, plat, is scattered with a random number, and then the PV value of each packet (that is, plat1 in sql) is calculated, and then the outermost layer sums the scattered pv.

Note: for the innermost sql, the range of random numbers added to the grouped key cannot be too large or too small. If it is too large, there will be too many groups, which will increase the pressure on the checkpoint. If it is too small, it will not play a role in breaking up. In my test, about a billion data a day, 5 parallelism, random numbers in the range of 100, can be processed normally.

After the modification, we see that the data of each subtask is basically uniform.

Image

After reading the above, do you have any further understanding of how to optimize the data skew of keyby windows in flink? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report