In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I will talk to you about how to optimize the data tilt of the keyby window in flink, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.
In the field of big data processing, data tilting is a very common problem, let's briefly talk about how to deal with the problem of streaming data tilting in flink.
Let's first take a look at a sql that may generate data skew.
Select TUMBLE_END (proc_time, INTERVAL'1' MINUTE) as winEnd,plat,count (*) as pv from source_kafka_table
Group by TUMBLE (proc_time, INTERVAL'1' MINUTE), plat
In this sql, we count the pv of each end of a website. The data consumed from kafka is first grouped according to the end, and then the aggregate function count is executed to calculate the pv. If the data generated by one end is very large, for example, the data generated by our WeChat Mini Programs end is much larger than that of other app ends, then after grouping the data into a certain operator, the data tilt will occur because the processing speed of this operator cannot keep up.
Look at the ui of flink and you will see the following scenario.
Image
For this simple data skew, we can add random numbers to the key of the packet, break it up again, calculate the pv number of different packets after the fragmentation, and then wrap another layer in the outermost layer to aggregate the scattered data again, so as to solve the problem of data skew.
The optimized sql is as follows:
Select winEnd,split_index (plat1,'_',0) as plat2,sum (pv) from (
Select TUMBLE_END (proc_time, INTERVAL'1' MINUTE) as winEnd,plat1,count (*) as pv from (
In the innermost layer, break up the grouped key, that is, plat, with a random number
Select plat | |'_'| | cast (cast (RAND () * 100as int) as string) as plat1, proc_time from source_kafka_table
) group by TUMBLE (proc_time, INTERVAL'1' MINUTE), plat1
) group by winEnd,split_index (plat1,'_',0)
In the innermost layer of this sql, the key of the packet, that is, plat, is scattered with a random number, and then the PV value of each packet (that is, plat1 in sql) is calculated, and then the outermost layer sums the scattered pv.
Note: for the innermost sql, the range of random numbers added to the grouped key cannot be too large or too small. If it is too large, there will be too many groups, which will increase the pressure on the checkpoint. If it is too small, it will not play a role in breaking up. In my test, about a billion data a day, 5 parallelism, random numbers in the range of 100, can be processed normally.
After the modification, we see that the data of each subtask is basically uniform.
Image
After reading the above, do you have any further understanding of how to optimize the data skew of keyby windows in flink? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.