The operation process of PCA method 04/15 Update SLTechnology News&Howtos

The operation process of PCA method

2025-04-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

This article is to share with you about the operation process of PCA method. The editor thought it was very practical, so I shared it with you as a reference. Let's follow the editor and have a look.

1 introduction

When we carry out data analysis, we often face two kinds of dilemmas, one is that there are too few feature attributes in the original data, "a clever housewife cannot make bricks without rice", it is difficult to dig out the potential rules, for this kind of situation, we can only make more efforts on the collection link. Another dilemma is just the opposite, that is, there are too many feature attributes, which is really a kind of happiness annoyance, because many feature attributes mean a large amount of information and great value that can be mined, but on the other hand, it may also cause a sharp increase in the amount of fitting and calculation. for this problem, the best way is to reduce the dimensionality of the data in the preprocessing stage.

When it comes to dimensionality reduction, it is natural to think of principal component analysis (Principal Component Analysis,PCA), because this method takes the lead among many dimensionality reduction methods and is the most widely used. Principal component analysis (PCA) is an unsupervised learning method. Its main point of view is that there is a linear correlation between the feature attributes of data, which leads to information redundancy between data. Through orthogonal transformation, linearly related features are represented by less linearly independent data to achieve the purpose of dimensionality reduction.

In the following part of this paper, the idea and operation process of PCA method are introduced from shallow to deep.

2 the principle of algorithm

2.1 maximum projection variance method

To facilitate the description, let's first take the data set on the two-dimensional plane as an example. As shown in the following figure, there is an oblique upward distribution of 45 degrees from the lower left to the upper right. Now, we want to reduce the dimension of the dataset, because it is two-dimensional data, so it can only be reduced to one dimension, we just need to find a suitable axis to project the data. In the simplest way, we can project the data directly onto the two existing axes, as shown in figure (a) (b). This method is equivalent to directly abandoning another feature dimension, which will directly lead to the complete loss of information on another feature dimension, which is often not desirable. Although the dimensionality reduction process will inevitably lead to the loss of information, we also want to maximize the preservation of the original information of the data. Since it is not advisable to project onto the existing coordinate axis, we construct a new coordinate system. As shown in figure (c), we construct a y-axis obliquely upward along the 45-degree angle from the lower left to the upper right. Intuitively, we also feel that it is more appropriate to project the data onto this y-axis than to direct the x1 axis and x2 axis directly, because at this time the y-axis is the most "consistent" with the data distribution, and the projection of the data is the most scattered on the y-axis. Or the variance of the projection of the data on the y-axis is the largest. This is the maximum projection variance method, through this method, the variance of the data is the largest in the projected space, so the difference of the data can be maximized, so more original data information can be retained.

Image

From a mathematical point of view, we analyze why the new coordinate system with the maximum variance is the best.

As shown in figure 2 below, suppose that our points A, B and C are the sample points after zero equalization of the data set in figure 1. Points A', B 'and C' are the projections of points A, B and C on the rotated X1' axis, respectively, and O is the coordinate origin. | AA' | indicates the distance from the original coordinate point A to the projection A'on the X1' axis, also known as projection error. Obviously, the smaller the projection error, the greater the similarity between An and A', then the projected data will retain more information, so the smaller the projection error, the better. Equivalent, the sum of the squares of the projection errors of each sample point | AA' | 2 + | BB' | 2 + | CC' | 2 is also as large as possible. Because the length of beveled edge | OA |, | OB |, | OC | is fixed, according to Pythagorean Theorem, | AA' | 2 + | BB' | 2 + | CC' | 2 + | OA' | 2 + | OB' | 2 + | OC' | 2 remains the same, which means that the smaller the projection error, the greater the projection error. In fact, | OA' | 2 + | OB' | 2 + | OC' | 2 is the sum of sample variances, so the maximum variance is the best new coordinate system.

Image

Now, we know the problem of how to determine the best direction for projection, but there are still problems that remain unsolved:

(1) the above description takes two-dimensional data as an example. Of course, for the dimensionality reduction of two-dimensional data, it is only necessary to find a dimension or an axis for projection. If the data with higher dimensions are reduced to one dimension, it is impossible to reduce them to one dimension. At this time, it is necessary to find multiple axes for projection. If you are looking for the first dimension, it is no problem to use variance maximization projection. If we still adhere to the maximization of variance when looking for the second dimension, then the coordinate axis of the second dimension will basically coincide with the coordinates of the first dimension, so the projected data are extremely relevant and meaningless. So, for high-dimensional data dimensionality reduction, how to determine multiple dimensional axes?

(2) after finding the new coordinate system, how to map the original data to the new coordinate system?

With these two questions, we continue to analyze.

2.2 Covariance matrix

The dimensionality reduction of PCA algorithm is mainly achieved by reducing the redundant information in the original data, which refers to the correlation between different feature attributes in the data set, such as working hours, education and salary, which are indeed three different feature attributes, but there is a certain influence between working hours and education and salary and treatment, in most cases. The longer the working hours and the higher the education, the higher the salary. Therefore, there is a correlation between working hours, education and salary, and the goal of PCA algorithm is to eliminate these correlations to achieve the purpose of dimensionality reduction.

For correlation, it is usually described by covariance in mathematics. Assuming that dataset X contains n samples and m feature attributes, and xi and xj are two different feature attributes in dataset X, then the covariance between xi and xj is:

Cov (xi,xj) = 1n − 1 ⋅∑ nk=1 (xik − x / I) (xjk − x / j)

In the formula, xik,xjk represents the value of the k th sample of xi and xj in the two feature attributes, and x is the mean value of xi,xj, respectively.

The value interval of covariance is [− 1recover1]. The greater the absolute value of covariance, the greater the correlation between the two feature attributes. When the covariance is less than 0, it means that the two feature attributes are negatively correlated. When the covariance is greater than 0, the two feature attributes are positively correlated. When the covariance is 0, the quantitative feature attributes are not related. In linear algebra, the two feature attributes are orthogonal.

In particular, Cov (xi,xi) represents the variance of the feature attribute xi.

Through the previous section, we know that when dimensionality reduction, the first projection direction is selected by variance maximization, and when the subsequent projection direction is selected, it is impossible for us to make the correlation between the dimensionality reduction data. Therefore, when selecting subsequent dimensions, we need to select the direction with the largest variance under the premise of being orthogonal to all selected projection directions, that is, covariance is 0. To sum up the process of dimension reduction, if we need to reduce from m-dimension to k-dimension, we should first select the direction with the largest projection variance as the first dimension among all possible directions. then select the direction with the largest variance as the second dimension direction from all the directions orthogonal to the first dimension, and repeat this step until k dimensions are selected.

It can be seen that in the whole process of dimensionality reduction, it is necessary to calculate not only the variance, but also the covariance between the feature attributes. Is there any way to unify the two? Yes, the covariance matrix. Each element in the covariance matrix corresponds to the covariance between the two feature attributes, for example, the element in row I and column j represents the covariance between the feature attribute and the feature attribute, and the element on the diagonal of the covariance matrix represents the variance of the feature attribute. The covariance matrix of dataset X is expressed as: image

If we carefully observe the covariance matrix, we can find that the covariance matrix is a real symmetric matrix, and the real symmetric matrix happens to have some good properties that can be used.

(1) A real symmetric matrix must be diagonalized, and the diagonal elements of its similar diagonal matrix are m eigenvalues.

(2) the eigenvalues of real symmetric matrices are real numbers and the eigenvectors are real vectors.

(3) the eigenvectors corresponding to different eigenvalues of real symmetric matrices are orthogonal.

Please note that these three properties are very important, it doesn't matter if you don't understand them, just remember them, and the rest of the content must be based on them. Because the eigenvector corresponding to the eigenvalue is the basis of the ideal to obtain the correct coordinate axis, and the eigenvalue is equal to the variance of the data in the coordinates after projection. So with the covariance matrix, the next thing to do is to diagonalize the covariance matrix, the process of diagonalization can be understood as the rotation of the original coordinate axis, that is, the process of finding the best projection axis. Through the process of diagonalization, all elements except diagonal elements can be zero, that is, the covariance is zero, and the feature attributes will become irrelevant. When the covariance matrix is diagonalized, the diagonal element is the eigenvalue and the variance on the coordinate axis after each projection. We select the eigenvector corresponding to the largest eigenvalue as the basis to transform the original data. The projection of the original data on the new coordinate axis can be obtained.

Let's roughly describe the principle of this coordinate transformation. In machine learning, we like to use vectors and matrices to represent data, because vectors and matrices have many good mathematical properties, and it is very convenient to carry out mathematical operations. As shown in figure 3 below, if we take a point from the data set shown in figure 1 and assume that the coordinate is (3), then we can express it as an arrow starting from the origin and ending with a point (3). The arrow is projected as 3 on the x1 axis and 1 on the x2 axis. We can understand that the projection of a vector on the two axes is x1 ⋅ x2, then the vector can be expressed as: x1 ⋅ (1) T+x2 ⋅ (0) T, where (1) and (0) are a set of bases of the black Cartesian coordinate system below. For the basis, it can be roughly understood as the basis of the coordinate axis, with the basis, the coordinates are meaningful, in most cases, we default to (1) and (0) are orthogonal to each other and the module length is 1 vector as the base. If we rotate the black Cartesian coordinate system counterclockwise by 45, we get a new coordinate system. The bases of this coordinate system are (12 √, 12 √) and (− 12 √, 12 √). We do not discuss how this basis is obtained. Anyway, in the PCA method, the corresponding Eigenvectors of many eigenvalues after diagonalization of covariance are the basis of the new coordinate system. With the basis of the new coordinate system, how can the coordinates of the original coordinate system be transformed into the new coordinate system? In fact, we only need to do the inner product operation between the base of the new coordinate system and the coordinates in the original coordinate system: do the inner product operation between the original coordinate and the two bases, and the two results obtained are taken as the first coordinate and the second coordinate of the new coordinate system, respectively. this process is called base transformation, and we use matrix operation to represent this process:

Image

So the coordinates of the point (3 √ 1) in the new coordinate system are (42 √, − 2-√). This method of basis transformation is also suitable for more multi-dimensional cases, because the multiplication of two matrices is essentially a linear transformation, which can also be understood as transforming each column vector in the Vernier matrix into a space in which each row vector in the left matrix is the base coordinate.

Image

To sum up, after diagonalization, the elements on the diagonals of the matrix are the eigenvalues and the projection variances of many axes found, the largest one of all eigenvalues is removed each time, and then the corresponding eigenvector is calculated. this eigenvector is the basis of the corresponding new axis, and the projection of the original data on the new axis can be obtained by using the inner product operation based on the original data. Repeat this process k times, and the dimensionality can be reduced.

Taking all the above together, the principal component analysis is summarized into the following five steps:

(1) Zero average, in many cases, in order to remove the influence of dimension, it is best to standardize directly.

(2) calculate the covariance matrix.

(3) diagonalization of covariance matrix to find eigenvalues.

(4) sort the eigenvalues from large to small, select the largest k of them, and then find the corresponding k eigenvectors as row vectors to form the eigenvector matrix P.

(5) k Eigenvectors are used as the basis of the new coordinate system to transform the original data.

3 Summary

PCA algorithm is an unsupervised learning method, which only needs to operate on the feature attributes of the data set itself to eliminate correlation to achieve the purpose of denoising and dimensionality reduction of compressed data. The main advantages of PCA algorithm are:

(1) the amount of information only needs to be measured by variance and is not affected by factors other than the data set.

(2) the principal components are orthogonal, which can eliminate the factors affecting each other among the components of the original data.

(3) the calculation method is simple and easy to implement.

The main disadvantages of PCA algorithm are:

The main results are as follows: (1) the meaning of each feature dimension of principal component no longer has actual physical meaning, so it is not as strong as the explanation of the original sample feature.

(2) non-principal components with small variance may also contain important information about sample differences, because dimensionality reduction and discarding may have an impact on subsequent data processing.

The operation process of PCA method is shared here. I hope the above content can be of some help to you and can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.