Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to realize similarity calculation in Mahout

2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)06/01 Report--

This article introduces how to achieve similarity calculation in Mahout. The content is very detailed. Interested friends can use it for reference. I hope it will be helpful to you.

Recommendation systems widely used in reality are generally based on collaborative filtering algorithms, which usually need to calculate the similarity between users or projects. For data sources with different data volume and data types, different similarity calculation methods are needed to improve recommendation performance. Mahout provides a large number of components for calculating similarity. These components implement different similarity calculation methods. The following figure shows the relationship between the components used to implement similarity calculation:

Figure 1. Project similarity calculation component

Figure 2. User similarity calculation component

Here are some key similarity calculation methods:

Pearson correlation

Class name: PearsonCorrelationSimilarity

Principle: a statistic used to reflect the linear correlation of two variables.

Range: [- 1pr 1], the larger the absolute value, the stronger the correlation, and the negative correlation has less significance for the recommendation.

Note: 1, do not consider the number of overlap; 2, if there is only one overlap, it is impossible to calculate the similarity (the divisor of the calculation process has nmur1); 3, if the overlapping values are all equal, it is impossible to calculate the similarity (standard deviation is 0, do divisor).

This similarity is not the best choice, nor the worst choice, just because it is easy to understand and is often mentioned in early research. The use of Pearson linear correlation coefficient must assume that the data are obtained from the normal distribution in pairs, and that the data must be evenly spaced at least in the logical category. In Mahout, it provides an extension for Pearson correlation calculation, by adding an enumeration type (Weighting) parameter to make the overlap number also become an influence factor of computing similarity.

Euclidean distance similarity

Class name: EuclideanDistanceSimilarity

Principle: using the similarity defined by Euclidean distance d.

Range: [0jue 1], the higher the value, the smaller the d, that is, the closer the distance, the greater the similarity.

Explanation: like Pearson similarity, this similarity does not take into account the influence of overlap number on the results. Similarly, Mahout adds a parameter of enumeration type (Weighting) to make the overlap number become an influence factor for calculating similarity.

CoSine similarity

Class names: PearsonCorrelationSimilarity and UncenteredCosineSimilarity

Principle: the cosine of the angle between two points in multi-dimensional space and the set point.

Range: [- 1pr 1], the larger the value, the greater the angle, the farther the distance between the two points, and the smaller the similarity.

Description: in the mathematical expression, if the attributes of the two items are datacenter, the calculated cosine similarity and Pearson similarity are the same. In mahout, the process of data centralization is realized, so Pearson similarity is also the cosine similarity after data centralization. In addition, in the new version, Mahout provides the UncenteredCosineSimilarity class as a cosine similarity for calculating decentralized data.

Spearman rank correlation coefficient

Class name: SpearmanCorrelationSimilarity

Principle: the Spearman rank correlation coefficient is usually regarded as the Pearson linear correlation coefficient between the arranged variables.

Range: {- 1.0, 1.0}, 1.0 when consistent,-1.0 when inconsistent.

Description: the calculation is very slow, there are a large number of sorting. For the dataset in the recommendation system, it is not appropriate to use the Spearman rank correlation coefficient as the similarity measure.

Manhattan distance

Class name: CityBlockSimilarity

Principle: the realization of Manhattan distance, similar to Euclidean distance, is used to measure the spatial distance of multi-dimensional data.

Range: [0jue 1], which is consistent with Euclidean distance. The smaller the value is, the greater the distance value is, the greater the similarity is.

Description: less calculation and higher performance than Euclidean distance.

Tanimoto coefficient

Class name: TanimotoCoefficientSimilarity

Principle: also known as the generalized Jaccard coefficient, is an extension of the Jaccard coefficient, the equation is

Range: [0J1], it is 1 when there is complete overlap, and 0 when there is no overlap. The closer it is to 1, the more similar it is.

Description: deal with ungraded preference data.

Logarithmic likelihood similarity

Class name: LogLikelihoodSimilarity

Principle: the number of overlap, the number of non-overlap, the number of none

Scope: you can go to Baidu Library to find the paper "Accurate Methods for the Statistics of Surprise and Coincidence".

Note: it is more intelligent to deal with non-scoring preference data than the calculation method of Tanimoto coefficient.

On how to achieve similarity calculation in Mahout to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report