How to analyze the principle of PCA 11/03 Update SLTechnology News&Howtos

How to analyze the principle of PCA

2025-11-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to carry out PCA principle analysis, I believe that many inexperienced people do not know what to do about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

PCA (Principal Component Analysis) is a commonly used data analysis method. PCA transforms the original data into a set of linearly independent representations of each dimension through linear transformation, which can be used to extract the main feature components of the data and is often used to reduce the dimensionality of high-dimensional data. There are many articles about PCA on the Internet, but most of them only describe the analysis process of PCA, but not the principle of it. The purpose of this article is to introduce the basic mathematical principles of PCA and help readers understand how PCA works.

Of course, I do not intend to write the article as a pure mathematical article, but want to describe the mathematical principles of PCA in an intuitive and easy-to-understand way, so the whole article will not introduce strict mathematical derivation. It is hoped that readers can better understand how PCA works after reading this article.

Vector representation and dimensionality reduction of data

In general, in data mining and machine learning, data is represented as vectors. For example, the traffic and transactions of a Taobao store for the whole year of 2012 can be regarded as a collection of records, in which the data for each day is a record in the following format:

(date, number of views, number of visitors, number of orders placed, number of transactions, transaction amount)

Where "date" is a record flag rather than a measure, and data mining is mostly concerned with measures, so if we ignore the date field, we get a set of records, each of which can be represented as a five-dimensional vector. One of them looks something like this:

Note that I use transpose here, because it is customary to use a column vector to represent a record (see why later), and this guideline will be followed later in this article. However, for convenience, I sometimes omit the transpose symbol, but when we say that the vector refers to the column vector by default.

Of course, we can analyze and mine this set of five-dimensional vectors, but we know that the complexity of many machine learning algorithms is closely related to the dimension of the data, even exponentially related to the dimension. Of course, it may not matter that there are only five-dimensional data here, but it is not uncommon to deal with thousands or even hundreds of thousands of dimensions in actual machine learning. in this case, the resource consumption of machine learning is unacceptable, so we must reduce the dimensionality of the data.

Dimensionality reduction certainly means the loss of information, but in view of the relevance of the actual data itself, we can find ways to reduce the dimensionality while reducing the loss of information as much as possible.

For example, if a student status data has two columns M and F, what is the value of the M column? the student takes 1 for the male and 0 for the female, while the F column takes the 1 for the female and 0 for the male. At this time, if we count all the student status data, we will find that for any record, F must be 0 when M is 1, and F must be 1 when M is 0. In this case, we remove M or F without actually any loss of information, because as long as one column is retained, the other column can be completely restored.

Of course, the above is an extreme situation, which may not happen in reality, but similar situations are still very common. For example, from the above Taobao store data, we can know from experience that "pageviews" and "visitors" tend to have a strong correlation, while "order number" and "transaction number" also have a strong correlation. Here we informally use the word "relevance", which can be understood intuitively as "when the number of visitors to this store is high (or low) on a certain day, we should think that the number of visitors on that day is also higher (or lower)." We will give a strict mathematical definition of correlation in the following chapters.

This suggests that if we delete one of the metrics of pageviews or the number of visitors, we should expect not to lose much information. So we can delete one to reduce the complexity of the machine learning algorithm.

The above is a simple ideological description of dimensionality reduction, which can help to intuitively understand the motivation and feasibility of dimensionality reduction, but it does not have operational guiding significance. For example, which column do we delete with the least loss of information? Or is it not simply deleting a few columns at all, but transforming the original data into fewer columns while minimizing the loss of information? How exactly do you measure how much information is lost? How to determine the specific steps of dimensionality reduction based on the original data?

In order to answer the above questions, the problem of dimensionality reduction should be discussed mathematically and formally. PCA is a dimensionality reduction method with strict mathematical foundation and has been widely used. I won't describe PCA directly below, but let's "invent" PCA together by analyzing the problem step by step.

Vector representation and Base Transformation

Since the data we are faced with is abstracted into a set of vectors, it is necessary to study the mathematical properties of some vectors. These mathematical properties will become the theoretical basis for the subsequent derivation of PCA.

Inner product and projection

Let's first take a look at a vector operation that we learned in high school: inner product. The inner product of two vectors with the same dimension is defined as:

The inner product operation maps two vectors to a real number. Its calculation method is very easy to understand, but its significance is not obvious. Let's analyze the geometric meaning of the inner product. Assuming that An and B are two n-dimensional vectors, we know that the n-dimensional vector can be expressed as a directed line segment emitted from the origin in n-dimensional space. For simplicity, we assume that An and B are both two-dimensional vectors. Then An and B can be represented by two directed line segments originating from the origin in the two-dimensional plane, as shown in the following figure:

OK, now let's draw a vertical line from point A to the straight line where B. We know that the intersection of the vertical line and B is called the projection of An on B, and if the angle between An and B is a, then the vector length of the projection is the module of vector A, that is, the scalar length of A segment.

Note that here we specifically distinguish the vector length from the scalar length, the scalar length is always greater than or equal to 0, and the value is the length of the line segment; while the vector length may be negative, and its absolute value is the length of the line segment, and the symbol depends on whether its direction is the same or opposite to the standard direction.

So far, we still don't see what the inner product has to do with this thing, but if we express the inner product as another form that we are familiar with:

Now things seem to be getting better: the inner product of An and B is equal to the projection length from A to B multiplied by the module of B. Further, if we assume that the module of B is 1, that is, let, then it becomes:

That is to say, if the module of vector B is 1, then the inner product of An and B is equal to the length of the vector projected from A to the straight line where B is located. This is a geometric explanation of the inner product, and it is also the first important conclusion we have obtained. This conclusion will be used repeatedly in later deductions.

Base

Let's continue to talk about vectors in two-dimensional space. As mentioned above, a two-dimensional vector can correspond to a directed segment from the origin in a two-dimensional Cartesian Cartesian coordinate system. For example, the following vector:

In terms of algebraic representation, we often use the point coordinates of the end of the line segment to represent the vector. For example, the above vector can be expressed as (3), which is all too familiar to us.

However, we often overlook the fact that only one (3jue 2) itself cannot accurately represent a vector. Let's take a closer look, what the 3 here actually means is that the projection value of the vector on the x-axis is 3 and the projection value on the y-axis is 2. In other words, we implicitly introduce a definition: a vector with a positive length of 1 on the x-axis and y-axis as the standard. So a vector (3) actually means that the projection is 3 on the x-axis and 2 on the y-axis. Notice that the projection is a vector so it can be negative.

More formally, the vector (xmemy) actually represents a linear combination:

It is not difficult to prove that all two-dimensional vectors can be expressed as such a linear combination. Here (1) and (0) are called a set of bases in two-dimensional space.

Therefore, in order to accurately describe the vector, we must first determine a set of bases, and then give the projection values on each straight line where the base is located. It's just that we often omit the first step and default to (1) and (0).

Of course, it is more convenient for us to choose (1) and (0) as bases by default, because they are unit vectors in the positive direction of x and y axes, respectively, so it is very convenient for point coordinates and vectors to correspond one to one on the two-dimensional plane. But in fact, any two linearly independent two-dimensional vectors can become a set of bases, the so-called linear independence in the two-dimensional plane can be intuitively regarded as two vectors that are not on a straight line.

For example, (1) and (- 1) can also become a group of bases. Generally speaking, we want the module of the base to be 1, because we can see from the meaning of the inner product that if the module of the base is 1, then we can easily multiply the base by the vector point and directly obtain its coordinates on the new base. In fact, we can always find a vector whose modulus is 1 in the same direction for any vector, as long as the two components are divided by the module respectively. For example, the above base can become

Now, if we want to get the coordinates of (3p2) on the new basis, that is, the projection vector values in both directions, then according to the geometric meaning of the inner product, as long as we calculate the inner product of (3p2) and the inner product of the two bases, it is not difficult to get the new coordinates. The following figure shows a schematic diagram of the new base and the coordinate values of (3Power2) on the new base:

In addition, it should be noted here that the bases in our examples are orthogonal (that is, the inner product is 0, or intuitively perpendicular to each other), but the only requirement that can be a group of bases is linearly independent, and non-orthogonal bases are also possible. However, because the orthogonal bases have better properties, the bases commonly used are orthogonal.

Matrix representation of basis transformation

Let's find a simple way to represent the base transformation. Or take the above example, think about transforming (3p2) into the coordinates on the new basis, that is, using (3p2) and the first base to do the inner product operation as the first new coordinate component, and then using (3p2) and the second base to do the inner product operation as the component of the second new coordinate. In fact, we can succinctly represent this transformation in the form of matrix multiplication:

It's so beautiful! Where the two rows of the matrix are two bases, multiplied by the original vector, the result happens to be the coordinates of the new basis. We can generalize it a little bit, if we have m two-dimensional vectors, just arrange the two-dimensional vectors into a two-row m-column matrix in columns, and then multiply the matrix by the "base matrix" to get the values of all these vectors under the new basis. For example, (1) (1), (2), (3), (3), if you want to change to that group of bases, you can say:

So the base transformation of a set of vectors is cleanly represented as the multiplication of matrices.

In general, if we have M N-dimensional vectors and want to transform them into a new space represented by R N-dimensional vectors, then we first make R bases into matrix A by rows, and then vectors into matrix B by columns, then the product of the two matrices AB is the result of the transformation, where the m of AB is the result of the transformation of the m-th column in A.

The mathematical expression is as follows:

It is particularly important to note that here R can be less than N, and R determines the dimension of the transformed data. In other words, we can transform an N-dimensional data into a lower dimensional space, and the transformed dimension depends on the number of bases. Therefore, the representation of matrix multiplication can also represent the dimensionality reduction transformation.

Finally, the above analysis also finds a physical explanation for matrix multiplication: the meaning of the multiplication of two matrices is to transform each column vector in the right matrix into the space represented by each row vector in the left matrix. More abstractly, a matrix can represent a linear transformation. Many students feel strange about the method of matrix multiplication when learning linear algebra, but if they understand the physical meaning of matrix multiplication, its rationality will be clear at a glance.

Covariance matrix and optimization objective

We discussed above that choosing different bases can give different representations to the same set of data, and if the number of bases is less than the dimension of the vector itself, the effect of dimensionality reduction can be achieved. But we haven't answered the most critical question: how to choose the best base. In other words, if we have a set of N-dimensional vectors, and now we want to reduce it to K-dimensional (K is less than N), how should we choose K bases to maximize the original information?

The problem of complete mathematization is very complicated, and here we look at it in an informal and intuitive way.

In order to avoid too abstract discussion, we still start with a concrete example. Suppose our data consists of five records and represents them in matrix form:

Each of these is listed as a data record, and a row is a field. To facilitate subsequent processing, we first subtract the field mean from all the values in each field, and the result is to change each field to a mean of 0 (the rationale and benefits of doing so will be seen later).

If we look at the data above, the average value of the first field is 2, and the average value of the second field is 3, so after transformation:

We can look at what five pieces of data look like in a plane Cartesian coordinate system:

Now the question is: if we have to use one-dimensional representation of the data, and want to retain the original information as much as possible, how do you choose?

Through the discussion of the base transformation in the previous section, we know that this problem is actually to choose a direction in the two-dimensional plane, project all the data onto the straight line in that direction, and use the projection value to represent the original record. This is a practical problem of reducing two dimensions to one dimension.

So how do you choose this direction (or base) to retain as much raw information as possible? One intuitive view is that you want the projected values to be as scattered as possible.

As an example of the above figure, you can see that if you project to the x-axis, then the two leftmost points will overlap, and the middle two points will also overlap, so there are only two different values left after the projection of the four different two-dimensional points, which is a serious loss of information. Similarly, if you project the top two points to the y-axis and the two points distributed on the x-axis will also overlap. So it seems that neither the x axis nor the y axis is the best projection choice. According to our visual observation, if we project the oblique line through the first quadrant and the third quadrant, the five points can still be distinguished after the projection.

Next, we express the problem mathematically.

Variance

As mentioned above, we want the projection values to be as scattered as possible, and the degree of dispersion can be expressed by mathematical variance. Here, the variance of a field can be seen as the mean of the sum of squares of the difference between each element and the mean of the field, that is:

Since we have reduced the mean of each field to 0 above, the variance can be directly expressed by the sum of squares of each element divided by the number of elements:

So the above problem is formally expressed as: to find an one-dimensional base, so that after all the data are transformed into the coordinate representation on this basis, the variance is maximum.

Covariance

For the above two-dimensional to one-dimensional problem, just find the direction that maximizes the variance. But for higher dimensions, there is another problem that needs to be solved. Consider the reduction from three-dimensional to two-dimensional. As before, first we want to find a direction to maximize the variance behind the projection, so that we can choose the first direction, and then we choose the second projection direction.

If we simply choose the direction with the largest variance, it is obvious that this direction and the first direction should be "almost coincident". Obviously, such a dimension is useless, so there should be other constraints. Intuitively, we do not want a (linear) correlation between two fields to represent as much original information as possible, because correlation means that the two fields are not completely independent and there must be repeated information.

Mathematically, the correlation can be expressed by the covariance of two fields, and since the mean of each field has been set to 0, then:

As you can see, when the mean value of the field is 0, the covariance of the two fields is succinctly expressed as the inner product divided by the number of elements m.

When the covariance is 0, the two fields are completely independent. In order to make the covariance 0, we can only choose in the direction orthogonal to the first base when we choose the second base. Therefore, the two directions chosen in the end must be orthogonal.

So far, we have obtained the optimization goal of the dimensionality reduction problem: to reduce a set of N-dimensional vectors to K-dimension (K > 0, less than N). The goal is to select K units (module 1) orthogonal basis, so that after the original data is transformed to this set of bases, the covariance of each field is 0, while the variance of the field is as large as possible (under the orthogonal constraint, take the maximum K variance).

Covariance matrix

We have derived the optimization goal above, but this goal does not seem to be a direct guide (or algorithm), because it only says what it wants, but does not say how to do it at all. So we should continue to study the calculation scheme in mathematics.

We see that the ultimate goal to be achieved is closely related to the intra-field variance and the inter-field covariance. Therefore, we hope to unify the two. After careful observation, we find that both of them can be expressed in the form of inner product, which is closely related to the multiplication of matrices. So we came to inspiration:

Suppose we only have two fields an and b, then we form the matrix X by row:

Then we multiply X by the transpose of X and multiply by the coefficient 1:

It's a miracle! The two elements on the diagonal of this matrix are the variance of the two fields, while the other elements are the covariance of an and b. The two are unified into one matrix.

According to the algorithm of matrix multiplication, this conclusion can be easily extended to the general situation:

Let us have m n-dimensional data records and arrange them into a matrix X of n times m by column. Let C be a symmetric matrix whose diagonals have the variance of each field respectively, while the I-row j-column and j-row I-column elements are the same, indicating the covariance of I and j fields.

Diagonalization of covariance matrix

According to the above derivation, we find that to achieve optimization at present, it is equivalent to diagonalizing the covariance matrix: that is, the elements except diagonals are changed to 0, and the elements are arranged from top to bottom on the diagonals. in this way, we have achieved the purpose of optimization. This may not be very clear. Let's take a closer look at the relationship between the original matrix and the covariance matrix after the base transformation.

Let the covariance matrix corresponding to the original data matrix X be C, and P be a matrix consisting of a set of bases according to rows, and let Y=PX, then Y is the data after the basis transformation of X to P. Let the covariance matrix of Y be D, and we derive the relationship between D and C.

Things are clear now! The P we are looking for is nothing else, but the P that diagonalizes the original covariance matrix. In other words, the optimization goal is to find a matrix P, which satisfies a diagonal matrix, and the diagonal elements are arranged in order from the largest to the smallest, then the first K row of P is the basis to be found. Multiplying the matrix composed of the first K rows of P by X reduces X from N dimension to K dimension and satisfies the above optimization conditions.

At this point, we are only one step away from "inventing" PCA!

Now that all the focus is on the diagonalization of covariance matrices, sometimes we should really thank mathematicians for going first, because matrix diagonalization is already a broken thing in the field of linear algebra, so it's not a problem at all in mathematics.

As we know from the above, the covariance matrix C is a symmetric matrix. In linear algebra, the real symmetric matrix has a series of very good properties.

The main results are as follows: 1) the eigenvectors corresponding to different eigenvalues of real symmetric matrices must be orthogonal.

2) if the multiplicity of eigenvectors is r, then there must be r linearly independent Eigenvectors corresponding to r, so the units of these Eigenvectors can be orthogonalized.

From the above two items, we can know that a real symmetric matrix with n rows and n columns must find n unit orthogonal eigenvectors. Let these n Eigenvectors be, and we will form a matrix by column.

Then we have the following conclusions for the covariance matrix C:

It is a diagonal matrix, and its diagonal elements are the eigenvalues corresponding to each eigenvector (there may be repetition).

The above conclusions no longer give a strict mathematical proof, and those who are interested in the proof can refer to the books on linear algebra about the diagonalization of real symmetric matrices.

At this point, we find that we have found the matrix P:

P is a matrix arranged by rows after the eigenvector of the covariance matrix is uniformized, in which each row is an eigenvector of C. If P is arranged from top to bottom according to the eigenvalues of the middle eigenvalues, then the matrix composed of the first K rows of P is multiplied by the original data matrix X, and the reduced data matrix Y is obtained.

So far, we have completed the discussion of the mathematical principles of the whole PCA. In the following section, we will give an example of PCA.

Algorithms and examples

In order to consolidate the above theory, we give a specific PCA example in this section.

PCA algorithm

Summarize the algorithm steps of PCA:

There are m pieces of n-dimensional data.

1) form the original data into n-row m-column matrix X

2) Zero averaging of each row of X (representing an attribute field), that is, minus the mean of this row

3) find out the covariance matrix

4) find out the eigenvalues and corresponding Eigenvectors of the covariance matrix.

5) the eigenvectors are arranged into a matrix from top to bottom according to the corresponding eigenvalues, and the first k rows are taken to form the matrix P

6) the data after the dimension is reduced to k dimension

Example

What is mentioned above here

For example, we use the PCA method to reduce this set of two-dimensional data to one dimension.

Because each row of this matrix is already zero mean, here we directly calculate the covariance matrix:

Then find its eigenvalues and Eigenvectors, the specific solution method is no longer detailed, you can refer to the relevant materials. After the solution, the eigenvalues are:

The corresponding feature vectors are:

The result of the reduced-dimensional projection is as follows:

Further discussion

According to the above explanation of the mathematical principles of PCA, we can understand some of the capabilities and limitations of PCA. PCA essentially takes the direction of maximum variance as the main feature, and makes the data "uncorrelated" in all orthogonal directions, that is, they have no correlation in different orthogonal directions.

Therefore, PCA also has some limitations, for example, it can remove the linear correlation very well, but there is no way for the high-order correlation. For the data with high-order correlation, we can consider Kernel PCA and convert the nonlinear correlation into linear correlation through the Kernel function. In addition, PCA assumes that the main features of the data are distributed in the orthogonal direction, and if there are several directions with large variance in the non-orthogonal direction, the effect of PCA will be greatly reduced.

Finally, it should be noted that PCA is a non-parameter technology, that is to say, in the face of the same data, if you do not consider cleaning, who will do the result is the same, there is no subjective parameter intervention, so PCA is convenient for general implementation, but itself can not be personalized optimization.

I hope this article can help friends understand the mathematical theoretical basis and implementation principles of PCA, so as to understand the applicable scenarios and limitations of PCA, so as to make better use of this algorithm.

The selection of PCA K value.

In PCA, the value of k, that is, how to choose the principal components we retain, is a matter of concern. If k is too large, the data compression ratio is not high, and in the limit case, k is equivalent to using the original data; if k is too small, then the approximation error of the data is too large.

When determining the value of k, we need to consider the percentage of variance that can be preserved by different values of k. If = n, then our approximation of the data is perfect, that is, 100% of the variance of the data is retained, and all the changes in the data are preserved. If kicking 0, then only 0% of the change is retained.

The percentage of retained variance can be expressed by the sum of the selected eigenvalues and the sum of all eigenvalues. For example, in a two-dimensional experiment, if the first eigenvalue is 8 and the second is 2, then we retain 80% of the variance.

In image processing, for example, one convention is to retain 99% variance. For applications in other areas, the variance of 90% to 98% can be retained.

Therefore, if you introduce the details of your PCA algorithm to others and tell them that the k you choose retains 95% variance, it is easier to understand than to tell them that you have chosen the first 100 principal components.

After reading the above, have you mastered the method of how to analyze the principle of PCA? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.