How to use Spark Learning Matrix to decompose recommendation algorithm 07/12 Update SLTechnology News&Howtos

How to use Spark Learning Matrix to decompose recommendation algorithm

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to use Spark learning matrix decomposition recommendation algorithm, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain in detail for you, people with this need can come to learn, I hope you can gain something.

In the application of matrix decomposition in collaborative filtering recommendation algorithm, we summarize the application principle of matrix decomposition in recommendation algorithm. Here we use Spark to learn matrix decomposition recommendation algorithm from a practical point of view.

1. Overview of Spark recommendation algorithm

In Spark MLlib, the recommendation algorithm only implements the collaborative filtering recommendation algorithm based on matrix decomposition. The algorithm is based on the FunkSVD algorithm, which decomposes the score matrix M of m users and n items into two low-dimensional matrices:

Where k is the dimension decomposed into low dimensions, which is generally much smaller than m and n. If you are not familiar with the FunkSVD algorithm, you can review the corresponding principles.

2. Introduction of Spark recommendation algorithm class library

In Spark MLlib, the implemented FunkSVD algorithm supports the interface between Python,Java,Scala and R. Since the previous practice articles are all based on Python, the later introduction and use of this article will also use the Python interface of MLlib.

The interfaces corresponding to the Spark MLlib recommendation algorithm python are all in the pyspark.mllib.recommendation package, which has three classes, Rating, MatrixFactorizationModel and ALS. Although there are three classes, the algorithm is only the FunkSVD algorithm. The purpose of these three classes is described below.

The Rating class is relatively simple, just to encapsulate the three values of user, item and score. In other words, there are only users, items and scoring triples in the Rating class, and there is no functional interface.

ALS is responsible for training our FunkSVD model. The reason why we use the alternating least square method ALS here is because Spark uses ALS to optimize the objective function of the matrix decomposition of FunkSVD. The ALS function has two functions, one is train, which directly uses our score matrix to train the data, while the other function trainImplicit is a little more complex. It uses implicit feedback data to train the model. Compared with the train function, it has one more parameter that specifies the implicit feedback confidence threshold. For example, we can transform the score matrix into a feedback data matrix. The corresponding score value is transformed into confidence weight value according to certain feedback principle. Since the implicit feedback principle is generally based on specific problems and data, only the general score matrix decomposition is discussed later in this paper.

The MatrixFactorizationModel class is a model that we train with the ALS class, and this model can help us make predictions. Common predictions include the score of a user and an item, N favorite items of a user, N favorite items of a user, N favorite items of all users, and N users of all favorite items.

We will have examples of the use of these classes later.

3. Important class parameters of Spark recommendation algorithm

Here we summarize the important parameters of the ALS training model.

1) ratings: the RDD corresponding to the score matrix. You need us to enter. If it is implicit feedback, it is the implicit feedback matrix corresponding to the score matrix.

2) rank: the corresponding low-dimensional dimension in matrix decomposition. That is, dimension k in PTm × kQk × nPm × kTQk × n. This value affects the performance of matrix factorization, and the larger it is, the more time and memory the algorithm may take. Usually need to adjust the parameters, generally can take a number between 10-200.

3) iterations: the number of iterations when the matrix decomposition is solved by the alternating least square method. This value depends on the dimension of the score matrix and the degree of coefficient of the score matrix. Generally speaking, it doesn't need to be too big, such as 5-20 times. The default value is 5.

4) lambda: lambda_, is used in the python API because lambda is a reserved word of Python. This value is the corresponding regularization coefficient of FunkSVD decomposition. It is mainly used to control the fitting degree of the model and enhance the generalization ability of the model. The higher the value, the stronger the regularization punishment. Large recommendation systems generally need to adjust the parameters to get the appropriate value.

5) alpha: this parameter is useful only when using implicit feedback trainImplicit. An implicit feedback confidence threshold is specified, and the higher this value is, the more it is assumed that there is no correlation between the user and the item he has not scored. Generally, it is necessary to adjust the parameters to get the appropriate value.

As can be seen from the above description, it is quite simple to use the ALS algorithm. It should be noted that the main parameters to adjust the parameters are the dimension rank of the matrix decomposition and the regularized hyperparameter lambda. If it is implicit feedback, you also need to adjust the implicit feedback confidence threshold alpha.

4. Spark recommendation algorithm example

Let's use a specific example to illustrate the use of the Spark matrix decomposition recommendation algorithm.

Here we use MovieLens 100K data, the data download link is here.

After unzipping the data, we only use the score data in the u.data file. This data set has four columns per row, corresponding to user ID, item ID, score, and timestamp. Because my machine is relatively broken, in the following example, I only used the first 100 pieces of data. So if you use all the data, the later predictions will be different from mine.

First of all, you need to make sure that you have installed Hadoop and Spark (version no less than 1.6) and set the environment variables. Generally speaking, we study in ipython notebook (jupyter notebook), so we build a notebook-based Spark environment. Of course, it doesn't matter if you don't use notebook's Spark environment, it's just that you need to set environment variables before running each time.

If you don't have a Spark environment for notebook, you need to run the following code first. Of course, if you've already built it, the following code doesn't have to run.

Before running the algorithm, it is recommended to output Spark Context as follows. If the memory address can be printed normally, it means that the running environment of Spark is done.

Print sc

For example, my output is:

First of all, we read the u.data file into memory and try to output * * lines of data to check whether it is read successfully. Note that when copying the code, the data directory should use your own u.data directory. The code is as follows:

The output is as follows:

Upright 196\ t242\ T3\ t881250949'

You can see that the data is separated by\ t. We need to split the string of each line into an array and take only the first three columns without timestamping that column. The code is as follows:

The output is as follows:

[upright 1966, upright 242, upright 3']

At this time, although we have got the RDD corresponding to the score matrix array, but these data are still strings, what Spark needs is the array corresponding to several Rating classes. So we now convert the data type of RDD as follows:

The output is as follows:

Rating (user=196, product=242, rating=3.0)

It can be seen that our data is based on the RDD of the Rating class, and now we can finally train the sorted data. The code is as follows: we set the dimension of matrix decomposition to 20 and the number of iterations to 5, and the regularization coefficient to 0.02. In practical application, we need to select the appropriate matrix decomposition dimension and regularization coefficient through cross-validation. Here we simplify because we are examples.

After training the model, we can finally make the prediction of the recommendation system.

First, make the simplest prediction, such as predicting user 38's rating of 20 items. The code is as follows:

Print model.predict (38pr 20)

The output is as follows:

0.311633491603

It can be seen that the score is not high.

Now let's predict the 10 favorite items of user 38, the code is as follows:

Print model.recommendProducts (38pr 10)

The output is as follows:

[Rating (user=38, product=95, rating=4.995227969811873), Rating (user=38, product=304, rating=2.5159673379104484), Rating (user=38, product=1014, rating=2.165428673820349), Rating (user=38, product=322, rating=1.7002266119079879), Rating (user=38, product=111, rating=1.2057528774266673), Rating (user=38, product=196, rating=1.0612630766055788), Rating (user=38, product=23, rating=1.0590775012913558), Rating (user=38, product=327, rating=1.0335651317559753), Rating (user=38, product=98, rating=0.9677333686628911), Rating (Rating, Rating, Rating)]

It can be seen that user 38 may like the corresponding scores of 10 items from high to low.

Then let's predict the 10 most recommended users of item 20, the code is as follows:

Print model.recommendUsers (20 and 10)

The output is as follows:

[Rating (user=115, product=20, rating=2.9892138653406635), Rating (user=25, product=20, rating=1.7558472892444517), Rating (user=7, product=20, rating=1.523935609195585), Rating (user=286, product=20, rating=1.3746309116764184), Rating (user=222, product=20, rating=1.313891405211581), Rating (user=135, product=20, rating=1.254412853860262), Rating (user=186, product=20, rating=1.2194811581542384), Rating (user=72, product=20, rating=1.1651855319930426), Rating (user=241, product=20, rating=1.0863391992741023), Rating (Rating, Rating, Rating)]

Now let's take a look at the three most recommended items for each user, with the following code:

Print model.recommendProductsForUsers (3). Collect ()

Since the output is very long, the output copy will not be sent here.

The three most recommended users for each item are as follows:

Print model.recommendUsersForProducts (3). Collect ()

Also because the output is very long, the output copy will not be used here.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.