In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >
Share
Shulou(Shulou.com)06/01 Report--
This article shows you an example analysis based on the principal components of the R language, which is concise and easy to understand, which will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.
In data analysis, we often encounter high-dimensional data sets, so we need to reduce the dimension to simplify the calculation and model. Principal component analysis (PCA) is a classical method of data dimensionality reduction, which requires correlation between the analyzed variables, otherwise it will lose the original meaning of PCA. For example, in the comprehensive evaluation of students' performance, the comprehensive evaluation of regional development, the comprehensive evaluation of athletes and so on, there are often many evaluation indicators, so it is necessary to reduce the dimension of the data. use a few new variables to replace the original variables and retain the original information as much as possible, calculate the comprehensive score, and then give the comprehensive evaluation results.
Now suppose the raw data looks like this:
Principal component analysis steps:
Construct the original data matrix
Eliminating Dimensions-data Standardization
Establish covariance matrix (that is, correlation coefficient matrix)
Find out the eigenvalue and eigenvector
The number of principal components is determined according to variance and cumulative variance contribution rate.
The comprehensive score is obtained and the explanation of practical significance is given.
The following is a brief explanation of the meaning of the relevant variables in terms of theory.
Because there is p column data, we assume that there will be p principal components, which will not be greater than p, ! And then pick out the ones with large eigenvalues. Assume that each principal component satisfies the relationship with each indicator (column):
We need to find out that A = (A _ 1 ~ A _ 2) which satisfies the condition is actually the eigenvector of the covariance matrix of P variables.
For a data with n pieces of data and p indicators, the data matrix constructed is X = (Xij) nxp;. The data standardization of a matrix is to standardize all the columns of the matrix separately. For one column, for example, the formula used for data standardization is:
For column I (Xi) and column j (Xj) variables, the covariance and correlation coefficient matrix formulas are as follows:
After the data has been standardized, in fact, the covariance and the correlation coefficient matrix are exactly the same, because after the data is standardized, the variance is 1, and the difference between the covariance and the correlation coefficient is whether or not divided by two variances. Obviously, the covariance and correlation coefficient matrices are PxP.
We need to ask for the eigenvalues and Eigenvectors of the matrix named R (which is the column vectors of the above-mentioned A1 Magi A2ML. AP). The eigenvalues of matrices are found as in linear algebra. In R, all you need to do is call the function eigen (). In fact, each column of a matrix corresponds to an eigenvalue, and the size of the eigenvalue represents the importance of this column. The largest eigenvalue is the first principal component, and the smaller the eigenvalue (the eigenvalue / SUM (eigenvalue)) can be omitted. Then, according to p feature vectors, the score of each data record (row) on each principal component is calculated, and then the inner product / SUM (eigenvalue) is done with the eigenvalue, and the final comprehensive score is obtained.
Detailed explanation of an example of principal component analysis
The following is illustrated by principal component analysis with the R language's own dataset swiss, which contains data from 47 cities in Switzerland on 6 evaluation indicators.
Data preprocessing and data exploration
Head (swiss)
# determine whether the data is suitable for principal component analysis
Cor (swiss)
It can be seen that the correlation of several variables is relatively strong, indicating that this data is suitable for principal component analysis.
# standardized data (data-mean) / standard deviation
Sc.swiss
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.