SPARK big data calculates BUG processing: 04/19 Update SLTechnology News&Howtos

SPARK big data calculates BUG processing:

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Big data calculates BUG processing:

Resources before program modification:

Driver: 1

Worker: 2 sets

Program submit application for memory resources: 1 GB of memory

Memory allocation:

1. 20% for program running

2. 20% for Shuffle

3. 60% for RDD caching

Single TweetBean size: 3k

1. Memory overflow

Reason: because the program queries all the TweetBean and union them, the operation is done in memory. Then when a campaign has a large amount of data, such as 500W data, then 500W*10k=50G exceeds the memory limit.

Solution: first split the task according to the amount of data to avoid memory overflow caused by a lot of data in a single task. Put all the task sharding completion in the task list. Loop the task list, when the amount of data fetched from the task is greater than 200000, merge all the data and split it into 16 RDD fragments. Loop through the task list until the end.

The reason for fetching 200000 data in batches: 200000*3k=600M, the memory available for the two machines to run the program = 2 (number of machines) * 2G (memory requested by the program) * 0.2 (the proportion of memory used for program running) = 800m, which can be used to store 200000 data and avoid memory overflow.

two。 Running slowly

Reason: because of the two machines, the amount of memory available for shuffle per machine = 2 (number of machines) * 1G (memory requested by the program) * 0.2 (ratio of memory used to run the program) = 400m.

200000 (amount of data processed in a batch) * 3k (size of a single TweetBean) = 600m. The amount of data in a batch Shuffle is larger than the available memory of the machine, so the data will be Flush to the hard disk, resulting in slow data reading.

Solution: adjust the available memory of the program Shuffle, as follows:

Program applies for memory resources: 2G

Memory allocation:

1. 20% for program running

2. 60% for Shuffle

3. 20% for RDD caching

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.