Prime_DSC_MentionCalcSpark performance tuning 07/19 Update SLTechnology News&Howtos

Prime_DSC_MentionCalcSpark performance tuning

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Brief introduction of Prime_DSC_MentionCalcSpark system

Implementation function: read text data from HBase data source as input according to conditions (siteId, startTime, endTime, campaignId, folder), take submitted keywords as conditions, and output the number of keywords mentioned in the text

There is a problem: it takes a long time to calculate a large amount of data.

The solution is as follows:

The TweetBean is constructed by reflecting the HBase result into TweetBean and modifying the setXXX of TweetBean.

When there are 5W pieces of data, it takes 60s to convert it to TweetBean through reflection, and 20s to setXX through TweetBean.

Change all fields for reading HBase to fields required for reading HBase

When there are 5W pieces of data, it takes 60s to read all the fields and 25s to read the required fields.

When fetching DC data from UC, instead of using the map function, replace it with the mapPartition function, so that you can fetch data in batches from the HBase, requiring only one HBase connection.

Store the calculation results and use the foreachPartition function. When traversing the Iterator, instead of storing the calculation results every time in the loop, the queue is maintained outside the loop and the results are stored in batches.

According to Spark cluster resources, make rational use of Spark cluster resources, for example, the more resources, the stronger the cluster computing power. The reasonable relationship between machine resources and task parallelism is: number of tasks = number of machine CPU cores * (2 or 3), so set the number of partitions of RDD to cluster CPU cores * 2

The parallelism of reading data from HBase is related to the number of region in the table. By default, there is only one region when building a table, but when the region becomes larger and larger, the more region,region you need to split, the greater the threshold of split, resulting in a lot of data stored in one region. At this point, if you need to query a table, assuming that the table has 5 region, there will be 5 threads querying the data of 5 region at the same time, but if one of the region is very large, 10 times as much as the other region, then the read performance of this region is 10 times that of the other region, resulting in the delay of the whole task. The method to solve this problem can first make the data evenly distributed in each rowkey by pre-partitioning and using hash/MD5 on region, so that the data can be read more concurrently according to the uniform distribution of data.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.