Collaborative filtering algorithm based on Spark MLlib platform-Movie recommendation system 07/01 Update SLTechnology News&Howtos

Collaborative filtering algorithm based on Spark MLlib platform-Movie recommendation system

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Collaborative filtering algorithm based on Spark MLlib platform-Movie recommendation system

I haven't written an article for a while, Amitabha Buddha. Recently, financial recommendations have to be made in the project, so look back and review the application of collaborative filtering algorithm in the recommendation system.

When it comes to recommendation systems, you may immediately think of collaborative filtering algorithms. This paper implements a simple application of recommending movies to users based on Spark MLlib platform. It mainly includes three parts:

Overview of Collaborative filtering algorithm

Model-based collaborative filtering application-movie recommendation

Analysis of Real-time recommendation Architecture

I. Overview of collaborative filtering algorithms

My research on the algorithm is not very in-depth at present. Here is a brief introduction to its working principle.

In general, collaborative filtering algorithms can be divided into:

1) user based (UserCF)

2) based on goods (ItemCF)

3) Model-based (ModelCF)

According to the model, it can be divided into:

1) nearest neighbor model: distance-based collaborative filtering algorithm

2) Latent Factor Mode (SVD): a model based on matrix decomposition

3) Graph: Graph model, social network graph model

In this paper, the collaborative filtering algorithm is based on the model of matrix decomposition.

1. Based on UserCF-based on user similarity

Based on the collaborative filtering of users, the similarity between users is evaluated by the scores of different users, and recommendations are made based on the similarity between users. To put it simply, it is to recommend other users' favorite items that are similar to his interests.

For example:

As shown in the picture, there are three users A, B, C and four items A, B, C and D, which need to recommend items to user A. Here, since both user An and user C have bought item An and item C, we think that user An and user C are very similar. At the same time, user C has bought item D, so we need to recommend item D to A user.

The basic idea based on UserCF is quite simple, based on the user's preference for items, find the neighboring users, and then recommend the products that the neighboring users like to the current users.

In calculation, the similarity between users is calculated by taking a user's preference for all items as a vector. after finding K neighbors, according to the similarity weight of neighbors and their preference for items, predict the unrelated items that the current user has no preference, and calculate a sorted list of items as a recommendation.

2. Based on commodity (ItemCF)-based on commodity similarity

Based on the collaborative filtering of goods, the similarity between item is evaluated by the score of different item, and the recommendation is made based on the similarity between item. To put it simply, it is to recommend to the user items that are similar to the items he liked before.

For example:

As shown in the picture, there are three users A, B, C and three items A, B, C, which need to recommend items to user C. Here, because user A has bought items An and C, user B has bought items A, B, C, and user C has bought item A, user An and B can see that both users have bought items An and C, indicating that items An and C are very similar. At the same time, user C has bought item A, so recommend item C to user C.

The principle based on ItemCF is similar to that based on UserCF, except that the item itself is used to calculate the neighbor, not from the user's point of view, that is, to find similar items based on the user's preference for items, and then recommend similar items to him according to the user's historical preferences.

From the point of view of calculation, the similarity between items is calculated by using the preference of all users for an item as a vector, and the similar items are obtained, and the items that have not been expressed by the current user are predicted according to the preference of the user history. a sorted list of items is calculated as a recommendation.

3. Model-based (ModelCF)

Model-based collaborative filtering recommendation is based on the sample user preference information, train a recommendation model, and then predict and calculate the recommendation according to the real-time user preference information.

The model based on matrix decomposition used in this paper, the algorithm is shown in figure:

Spark MLlib currently supports model-based collaborative filtering, in which users and goods are expressed by a small group of hidden factors, and these factors are also used to predict missing elements. MLlib uses alternating least squares (ALS) to learn these hidden factors.

If you are interested, you can read this part of the source code of Spark:

Second, model-based collaborative filtering application-movie recommendation

This paper realizes the simple application of the movie recommended by the user.

1. Test data description

The test data mainly includes four data files: (see README file for detailed data description)

1) user data file

User ID:: gender:: age:: occupation number:: zip code

2) Film data file

Movie ID:: Movie title:: movie Category

3) scoring data file

User ID:: movie ID:: rating:: time

4) Test data

User ID:: movie ID:: rating:: time

Here, the first three data files are used for model training, and the fourth data file is used to test the model.

2. Implementation code:

Import org.apache.log4j. {Level, Logger}

Import org.apache.spark.mllib.recommendation. {ALS, MatrixFactorizationModel, Rating}

Import org.apache.spark.rdd._

Import org.apache.spark. {SparkContext, SparkConf}

Import org.apache.spark.SparkContext._

Import scala.io.Source

Object MovieLensALS {

Def main (args: array [string]) {

/ / block unnecessary logs from being displayed on the terminal

Logger.getLogger ("org.apache.spark") .setLevel (Level.WARN)

Logger.getLogger ("org.apache.eclipse.jetty.server") .setLevel (Level.OFF)

/ / set up the running environment

Val sparkConf = new SparkConf () .setAppName ("MovieLensALS") .setMaster ("local [5]")

Val sc = new SparkContext (sparkConf)

/ / load the user score, which is generated by the rater (that is, the generated file personalRatings.txt)

Val myRatings = loadRatings (args (1))

Val myRatingsRDD = sc.parallelize (myRatings, 1)

/ / sample data directory

Val movielensHomeDir = args (0)

/ / load the sample score data, in which the last column of Timestamp takes the remainder of 10 as the value of key,Rating, that is, (Int,Rating)

Val ratings = sc.textFile (movielensHomeDir + "/ ratings.dat") .map {

Line = >

Val fields = line.split ("::")

/ / format: (timestamp% 10, Rating (userId, movieId, rating)

(fields (3). ToLong 10, Rating (fields (0). ToInt, fields (1). ToInt, fields (2) .toDouble)

}

/ / load the movie catalog comparison table (movie ID- > movie title)

Val movies = sc.textFile (movielensHomeDir + "/ movies.dat") .map {

Line = >

Val fields = line.split ("::")

/ / format: (movieId, movieName)

(fields (0) .toInt, fields (1))

}. Collect (). ToMap

/ / Statistics on the number of users and movies, as well as the number of users' ratings on movies

Val numRatings = ratings.count ()

Val numUsers = ratings.map (_. _ 2.user). Distinct (). Count ()

Val numMovies = ratings.map (_. _ 2.product). Distinct (). Count ()

Println ("Got" + numRatings + "ratings from" + numUsers + "users" + numMovies + "movies")

/ / the sample score table is divided into three parts with key values, which are used for training (60%, plus user rating), verification (20%), and and test (20%).

/ / this data has to be applied many times in the calculation process, so cache to memory

Val numPartitions = 4

Val training = ratings.filter (x = > x.room1)

< 6).values.union(myRatingsRDD).repartition(numPartitions).persist() val validation = ratings.filter(x =>

X.room1 > = 6 & & x.room1

< 8).values.repartition(numPartitions).persist() val test = ratings.filter(x =>

X.room1 > = 8) .values.persist ()

Val numTraining = training.count ()

Val numValidation = validation.count ()

Val numTest = test.count ()

Println ("Training:" + numTraining + "validation:" + numValidation + "test:" + numTest)

/ / train the model under different parameters, and verify it in the verification set to obtain the model under the best parameters.

Val ranks = List (8,12)

Val lambdas = List (0.1,10.0)

Val numIters = List (10,20)

Var bestModel: Option [MatrixFactorizationModel] = None

Var bestValidationRmse = Double.MaxValue

Var bestRank = 0

Var bestLambda =-1.0

Var bestNumIter =-1

For (rank (x.username x.product)

Val predictionsAndRatings = predictions.map {x = > ((x.username x.product), x.rating)}

.join (data.map (x = > ((x.userbell x.product), x.rating)) .values

Math.sqrt (predictionsAndRatings.map (x = > (x.room1-x.room2) * (x.room1-x.room2)) .reduce (_ + _) / n)

}

/ * * load user rating file personalRatings.txt * * /

Def loadRatings (path:String): seq [rating] = {

Val lines = Source.fromFile (path). GetLines ()

Val ratings = lines.map {

Line = >

Val fields = line.split ("::")

Rating (fields (0) .toInt, fields (1) .toInt, fields (2) .toDouble)

} .filter (_ .clients > 0.0)

If (ratings.isEmpty) {

Sys.error ("No ratings provided.")

} else {

Ratings.toSeq

}

3. Run the program

1) set parameters and run the program

There are two input parameters: the first is the data file directory, and the second is the test data.

2) the running effect of the program-the process of model training

From the point of view of the running effect, there are a total of 6040 users, 3706 movies (which have been removed) and 1000209 pieces of scoring data. For example, we divide all the data into three parts: 60% for training, 20% for user verification, and 20% for user test model. Next is the value of the root mean square error (RMSE) of the model under different parameters, and the corresponding parameters, the optimal parameter to choose the parameter value with the minimum root mean square error (RMSE---0.8665911...)-the establishment of the optimal parameter model. Then, 20% of the test model data is used to test the quality of the model, that is, root mean square error (RMSE). The calculated result is 0.86493444. The accuracy is improved by 22.32% based on the optimal parameter model.

Below, in fact, in the division of data (60% "20%" 20%), it is best to divide the data randomly, so that the result is more convincing.

3) the running effect of the program-the result of movie recommendation

Finally, recommend 10 movies to users that they have never seen.

4. Summary

In this way, a simple model-based movie recommendation application can be called OK.

Third, real-time recommendation architecture analysis

Above, to achieve a simple recommendation system application, but only to achieve the user-oriented recommendation, the value in the practical application is not very great, if the value is reflected, it is best to achieve real-time or quasi-real-time recommendation.

Here is a brief introduction to an architecture for real-time recommendation:

This framework is taken from the real-time architecture of Taobao Spark On Yarn. Here, some personal views are given:

The architecture diagram is divided into three layers: offline, near-line, and online.

Offline part: mainly realize the establishment of the model. The original data is processed and cleaned by ETL to get the target data, and the target business data is combined with the appropriate algorithm to learn the training model to get the best model.

Near-line part: HBase is mainly used to store user behavior information, and the model hybrid system synthesizes the model processing results of explicit feedback and implicit feedback, and recommends the final results to users.

Online part: here, there are mainly two kinds of feedback, explicit and implicit, personal understanding, and explicit feedback is understood as the user's behavior of adding goods to the shopping cart and buying goods; implicit feedback is understood as the time the user stays on a product and which goods the user clicks on. Here, in order to achieve real-time / quasi-real-time operation, Spark Streaming is used to process the data in real time. (it could be Flume+Kafka+Spark Streaming architecture)

Here are some personal understandings and inadequacies. I hope you can give me some advice.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.