An example Analysis of the comparison of Collaborative filtering recommendation algorithm on MapReduce and Spark 07/01 Update SLTechnology News&Howtos

An example Analysis of the comparison of Collaborative filtering recommendation algorithm on MapReduce and Spark

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Collaborative filtering recommendation algorithm implements comparative case analysis on MapReduce and Spark. In view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

MapReduce provides strong support for big data mining, but complex mining algorithms often require multiple MapReduce jobs to complete, and there are redundant disk read and write overhead and multiple resource application processes among multiple jobs, so there are serious performance problems in the implementation of MapReduce-based algorithms. Spark, a rising star of large processing, benefits from its advantages in iterative computing and memory computing. It can automatically schedule complex computing tasks and avoid the disk read-write and resource application process of intermediate results, so it is very suitable for data mining algorithms. Tencent TDW Spark platform has been deeply modified based on the latest Spark version of the community, and has been greatly improved in terms of performance, stability and scale, providing strong support for big data's mining task.

The following is a comparison of the implementation of the item-based collaborative filtering recommendation algorithm on TDW Spark and MapReudce. Compared with MapReduce,TDW Spark, the execution time is reduced by 66% and the computing cost is reduced by 40%.

Introduction of algorithm

The development of the Internet has led to an information explosion. In the face of a large amount of information, how to select and filter the information and show the most interesting information to the users has become an urgent problem to be solved. Through the connection between users and information, the recommendation system can help users obtain useful information on the one hand, and let the information be displayed in front of users who are interested in it on the other hand, thus achieving a win-win situation between information providers and users.

Collaborative filtering recommendation (Collaborative Filtering Recommendation) algorithm is the most classical and most commonly used recommendation algorithm. By analyzing user interests, the algorithm finds similar users of specified users in the user group, and synthesizes these similar users' evaluation of a certain information to form a system to predict the degree of preference of the designated user for this information. Collaborative filtering can be subdivided into three categories:

User-based CF: collaborative filtering based on User, which evaluates the similarity between users through the scores of different users on Item, and makes recommendations according to the similarity between users.

Item-based CF: collaborative filtering based on Item, which evaluates the similarity between Item by users' scores on different Item, and makes recommendations according to the similarity between Item.

Model-based CF: model-based Collaborative filtering (Model-based Collaborative Filtering) first obtains a model from historical data, and then uses this model for prediction and recommendation.

Problem description

Enter the data format: Uid,ItemId,Rating (user Uid's score for ItemId).

Output data: the top N ItemId with the highest similarity per ItemId.

Due to space constraints, we only choose the collaborative filtering algorithm based on Item to solve this example.

Algorithmic logic

The basic assumption of Item-based collaborative filtering algorithm is that two similar Item are more likely to be praised by the same user. Therefore, the algorithm first calculates the user's preference for items, then calculates the similarity between Item according to the user's preference, and finally finds out the top N Item that are most similar to each Item. The detailed description of the algorithm is as follows:

Calculate user preferences: the score values of different users on Item may vary greatly, so it is necessary to dualize the score of each user first, for example, for a user whose score of an item is greater than its average score, it is marked as good 1, otherwise it is bad 0.

Calculate Item similarity: Jaccard coefficient is used as the method to calculate the similarity of two Item. Narrow Jaccard similarity is suitable for calculating the degree of similarity between two sets. The calculation method is the intersection of two sets divided by their union, which is divided into the following three steps.

1) count the number of high praise users in Item, and count the number of high praise users in each Item.

2) Statistics of Item key-value pairs and the number of users with the same praise of any two associated Item.

3) Item similarity calculation, calculate the similarity of any two related Item.

Find out the first N Item that are most similar. In this step, the similarity of Item also needs to be normalized and integrated, and then find out the top N Item that are most similar to each Item, which is divided into the following three steps.

1) Item similarity normalization.

2) Item similarity score integration.

3) obtain the top N Item with the highest similarity for each Item.

Implementation scheme based on MapReduce

Using the MapReduce programming model requires the implementation of one MapReduce job for each step, and there are seven MapRduce jobs. Each MapReduce job contains Map and Reduce, where the Map reads from the HDFS, and the output data sends the key-value pair to the Reduce,Reduce phase through Shuffle as input, and outputs the processed key-value pair to HDFS. Its operation principle is shown in figure 1.

Seven MapReduce jobs mean that seven HDFS reads and writes are required, and their input and output data are associated, as shown in figure 2.

Compared with MapReduce,Spark, the execution time and resource usage of jobs are optimized in the following aspects.

DAG programming model. Through Spark's DAG programming model, seven MapReduce can be reduced to one Spark job. Spark automatically splits the job into eight Stage, each Stage containing multiple Tasks that can be executed in parallel. Data between Stage is passed through Shuffle. In the end, you only need to read and write HDFS once. Reduced six times of HDFS read and write, and reduced read and write HDFS by 70%.

After the Spark job starts, it will apply for the required Executor resources. The Tasks of all Stage runs in a threaded manner and shares Executors. Compared with the MapReduce method, the number of Spark requests for resources is reduced by nearly 90%.

Spark introduces the RDD (Resilient Distributed Dataset) model, in which the intermediate data is stored in the form of RDD, while the RDD is distributed in the memory of the slave node, which reduces the number of times to read and write disks in the computing process. RDD also provides a Cache mechanism. For example, after Cache the rdd3 above, both rdd4 and rdd7 can access the rdd3 data. Reduce the problem of MR2 and MR3 reading the same data repeatedly compared to MapReduce.

Effect comparison

The test uses resources of the same size, including 200 Map and 100 Reduce in MapReduce mode, with 4G memory for each Map and Reduce; because Spark no longer needs Reduce resources, while MapReduce main logic and resources are consumed on Map, 200,400 Executor are used for testing, and each Executor contains 4G memory. The test results are shown in the table below, in which about 3.8 billion input records are recorded.

Compared with the first and second rows of the result table, the running efficiency and cost of Spark are significantly lower than those of MapReduce. Among them, the DAG model reduces the read and write of HDFS by 70%, and cache reduces the read of duplicate data. These two optimizations can reduce the running time and cost of the job, while the reduction of resource scheduling times can improve the running efficiency of the job.

Compared with the second and third rows of the result table, the number of Executor is doubled, the running time of the job is reduced by about 50%, and the cost is increased by about 25%. From this result, increasing the Executor resources can effectively reduce the running time of the job, but does not achieve a complete linear increase. This is because the running time of each Task is not exactly the same, for example, some task processes more data than other task; this may cause some Task at the last moment of the Stage to not be able to start the next Stage, on the other hand, the job has always occupied the Executor, at this time there will be some Executor idle, resulting in an increase in cost.

Most of the data mining services have complex processing logic, and the traditional MapReduce/Pig framework has serious performance problems when dealing with this kind of data processing tasks. For these tasks, if you take advantage of the iterative and memory computing advantages of Spark, the running time and computing cost will be greatly reduced. At present, TDW has maintained thousands of Spark clusters, and will further improve and improve resource utilization, stability and ease of use to provide more favorable support for the business.

This is the answer to the case analysis question about the comparison of collaborative filtering recommendation algorithms on MapReduce and Spark. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.