In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces how to achieve similarity calculation in Mahout. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.
Recommendation systems widely used in reality are generally based on collaborative filtering algorithms, which usually need to calculate the similarity between users or projects. For data sources with different data volume and data types, different similarity calculation methods are needed to improve recommendation performance. Mahout provides a large number of components for calculating similarity. These components implement different similarity calculation methods. The following figure shows the relationship between the components used to implement similarity calculation:
Figure 1. Project similarity calculation component
Figure 2. User similarity calculation component
Here are some key similarity calculation methods:
Pearson correlation
Class name: PearsonCorrelationSimilarity
Principle: a statistic used to reflect the linear correlation of two variables.
Range: [- 1pr 1], the larger the absolute value, the stronger the correlation, and the negative correlation has less significance for the recommendation.
Note: 1, do not consider the number of overlap; 2, if there is only one overlap, it is impossible to calculate the similarity (the divisor of the calculation process has nmur1); 3, if the overlapping values are all equal, it is impossible to calculate the similarity (standard deviation is 0, do divisor).
This similarity is not the best choice, nor the worst choice, just because it is easy to understand and is often mentioned in early research. The use of Pearson linear correlation coefficient must assume that the data are obtained from the normal distribution in pairs, and that the data must be evenly spaced at least in the logical category. In Mahout, it provides an extension for Pearson correlation calculation, by adding an enumeration type (Weighting) parameter to make the overlap number also become an influence factor of computing similarity.
Euclidean distance similarity
Class name: EuclideanDistanceSimilarity
Principle: using the similarity defined by Euclidean distance d.
Range: [0jue 1], the higher the value, the smaller the d, that is, the closer the distance, the greater the similarity.
Explanation: like Pearson similarity, this similarity does not take into account the influence of overlap number on the results. Similarly, Mahout adds a parameter of enumeration type (Weighting) to make the overlap number become an influence factor for calculating similarity.
CoSine similarity
Class names: PearsonCorrelationSimilarity and UncenteredCosineSimilarity
Principle: the cosine of the angle between two points in multi-dimensional space and the set point.
Range: [- 1pr 1], the larger the value, the greater the angle, the farther the distance between the two points, and the smaller the similarity.
Description: in the mathematical expression, if the attributes of the two items are datacenter, the calculated cosine similarity and Pearson similarity are the same. In mahout, the process of data centralization is realized, so Pearson similarity is also the cosine similarity after data centralization. In addition, in the new version, Mahout provides the UncenteredCosineSimilarity class as a cosine similarity for calculating decentralized data.
Spearman rank correlation coefficient
Class name: SpearmanCorrelationSimilarity
Principle: the Spearman rank correlation coefficient is usually regarded as the Pearson linear correlation coefficient between the arranged variables.
Range: {- 1.0, 1.0}, 1.0 when consistent,-1.0 when inconsistent.
Description: the calculation is very slow, there are a large number of sorting. For the dataset in the recommendation system, it is not appropriate to use the Spearman rank correlation coefficient as the similarity measure.
Manhattan distance
Class name: CityBlockSimilarity
Principle: the realization of Manhattan distance, similar to Euclidean distance, is used to measure the spatial distance of multi-dimensional data.
Range: [0jue 1], which is consistent with Euclidean distance. The smaller the value is, the greater the distance value is, the greater the similarity is.
Description: less calculation and higher performance than Euclidean distance.
Tanimoto coefficient
Class name: TanimotoCoefficientSimilarity
Principle: also known as the generalized Jaccard coefficient, is an extension of the Jaccard coefficient, the equation is
Range: [0J1], it is 1 when there is complete overlap, and 0 when there is no overlap. The closer it is to 1, the more similar it is.
Description: deal with ungraded preference data.
Logarithmic likelihood similarity
Class name: LogLikelihoodSimilarity
Principle: the number of overlap, the number of non-overlap, the number of none
Scope: you can go to Baidu Library to find the paper "Accurate Methods for the Statistics of Surprise and Coincidence".
Note: it is more intelligent to deal with non-scoring preference data than the calculation method of Tanimoto coefficient.
On how to achieve similarity calculation in Mahout to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.