How to use scikit-learn tools to reduce PCA dimensionality 07/06 Update SLTechnology News&Howtos

How to use scikit-learn tools to reduce PCA dimensionality

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article shows you how to use scikit-learn tools to reduce the dimension of PCA, the content is concise and easy to understand, it will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

1. Introduction to scikit-learn PCA class

In scikit-learn, PCA-related classes are all in the sklearn.decomposition package. The most commonly used PCA class is sklearn.decomposition.PCA, and we will mainly explain the methods based on this class below.

In addition to the PCA class, the most commonly used PCA-related class is the KernelPCA class, which is mainly used for dimensionality reduction of nonlinear data and requires kernel techniques, as we have mentioned in the principle section. Therefore, when using it, it is necessary to select the appropriate kernel function and adjust the parameters of the kernel function.

Another commonly used PCA-related class is the IncrementalPCA class, which is mainly designed to address stand-alone memory limitations. Sometimes our sample size may be millions +, and the dimensions may be thousands, so directly fitting the data may cause memory to explode. In this case, we can use the IncrementalPCA class to solve this problem. IncrementalPCA first divides the data into multiple batch, and then calls the partial_fit function for each batch step by step to get the final optimal dimensionality reduction of the sample step by step.

There are also SparsePCA and MiniBatchSparsePCA. The main difference between them and the PCA class mentioned above is that they use L1 regularization, so that the influence of many non-major components can be reduced to zero, so that we only need to reduce the PCA dimension of those relatively major components when PCA dimensionality reduction, avoiding the impact of some noise and other factors on our PCA dimensionality reduction. The difference between SparsePCA and MiniBatchSparsePCA is that MiniBatchSparsePCA uses part of the sample features and a given number of iterations to reduce the PCA dimension to solve the problem of slow feature decomposition in large samples, of course, the cost is that the accuracy of PCA dimension reduction may be reduced. L1 regularization parameters need to be adjusted to use SparsePCA and MiniBatchSparsePCA.

2. Introduction of sklearn.decomposition.PCA parameters

Below we mainly based on sklearn.decomposition.PCA to explain how to use scikit-learn for PCA dimensionality reduction. The PCA class basically does not need to adjust parameters. Generally speaking, we only need to specify the dimension to which we need to reduce the dimension, or we want the variance of the principal component after the reduction and the proportional threshold to the sum of all the feature variances of the original dimension.

Now let's make an introduction to the main parameters of sklearn.decomposition.PCA:

1) n_components: this parameter can help us specify the number of feature dimensions that we want PCA to reduce. The most common practice is to directly specify the number of dimensions to be reduced to, where n_components is an integer greater than or equal to 1. Of course, we can also specify the variance of the principal component and the minimum proportion threshold, and let the PCA class determine the number of dimensions to be reduced according to the sample feature variance, in which case n_components is a number between (0prime1). Of course, we can also set the parameter to "mle". In this case, the PCA class will use the MLE algorithm to select a certain number of principal component features to reduce dimension according to the variance distribution of features. We can also use the default value, that is, do not enter n_components, at this time n_components=min (number of samples, number of features).

2) whiten: judge whether to whiten or not. The so-called whitening is to normalize each feature of the reduced-dimensional data so that the variance is 1. 5%. For PCA dimensionality reduction itself, there is generally no need for whitening. If you have subsequent data processing actions after PCA dimensionality reduction, you can consider whitening. The default value is False, that is, no whitening.

3) svd_solver: that is, the method of specifying singular value decomposition (SVD). Because Eigen decomposition is a special case of SVD SVD, general PCA libraries are based on SVD. There are four optional values: {'auto',' full', 'arpack',' randomized'}. Randomized is generally suitable for PCA dimensionality reduction with large amount of data, many data dimensions and low proportion of principal components. It uses some random algorithms to speed up SVD. Full is the traditional SVD, using the corresponding implementation of the scipy library. The applicable scenarios for arpack and randomized are similar, except that randomized uses scikit-learn 's own SVD implementation, while arpack directly uses the sparse SVD implementation of the scipy library. The default is auto, that is, the PCA class will weigh among the three algorithms mentioned earlier and choose an appropriate SVD algorithm to reduce dimensionality. In general, using the default value is sufficient.

In addition to these input parameters, there are two members of the PCA class that are of concern. The first is explained_variance, which represents the variance of principal components after dimensionality reduction. The greater the value of variance, the more important the principal component. The second is explained_variance_ratio_, which represents the proportion of the variance of each principal component to the total variance after dimensionality reduction. the larger this proportion, the more important the principal component.

3. PCA instance

Let's use an example to learn the use of the PCA class in scikit-learn. In order to facilitate visualization so that we have an intuitive understanding, we use three-dimensional data to reduce dimensionality.

First, we generate random data and visualize it. The code is as follows:

Import numpy as np

Import matplotlib.pyplot as plt

From mpl_toolkits.mplot3d import Axes3D

% matplotlib inline

From sklearn.datasets.samples_generator import make_blobs

# X is the sample feature, Y is the sample cluster category, a total of 1000 samples, each sample has 3 features, a total of 4 clusters

X, y = make_blobs (n_samples=10000, n_features=3, centers= [[3je 3,3], [0je 0re0], [1je 1d1], [2je 2rect 2]], cluster_std= [0.2,0.1,0.2,0.2]

Random_state = 9)

Fig = plt.figure ()

Ax = Axes3D (fig, rect= [0,0,1,1], elev=30, azim=20)

Plt.scatter (X [:, 0], X [:, 1], X [:, 2], marker='o')

The distribution of 3D data is as follows:

Let's not reduce the dimension, just project the data, and take a look at the variance distribution of the three dimensions after the projection. The code is as follows:

From sklearn.decomposition import PCA

Pca = PCA (n_components=3)

Pca.fit (X)

Print pca.explained_variance_ratio_

Print pca.explained_variance_

The output is as follows:

[0.98318212 0.00850037 0.00831751]

[3.78483785 0.03272285 0.03201892]

It can be seen that the variance ratio of the three feature dimensions after projection is about 98.3% VR 0.8% VA 0.8%. After projection, the first feature accounts for the vast majority of principal components.

Now let's reduce the dimension from 3D to 2D, and the code is as follows:

Pca = PCA (n_components=2)

Pca.fit (X)

Print pca.explained_variance_ratio_

Print pca.explained_variance_

The output is as follows:

[0.98318212 0.00850037]

[3.78483785 0.03272285]

This result can actually be expected, because the variance of the above three projected feature dimensions is respectively: [3.78483785 0.03272285 0.03201892]. After the projection to two dimensions, the first two features are definitely selected, while the third feature is abandoned.

In order to have an intuitive understanding, let's look at the transformed data distribution at this time. The code is as follows:

X_new = pca.transform (X)

Plt.scatter (X_new [:, 0], X_new [:, 1], marker='o')

Plt.show ()

The figure of the output is as follows:

It can be seen that the reduced-dimensional data can still clearly see the four clusters in our previous three-dimensional map.

Now let's look at the variance and proportion of the principal components that are not directly specified for the dimension reduction, but for the principal component after the dimension reduction.

Pca = PCA (n_components=0.95)

Pca.fit (X)

Print pca.explained_variance_ratio_

Print pca.explained_variance_

Print pca.n_components_

We specify that the principal component accounts for at least 95%, and the output is as follows:

[0.98318212]

[3.78483785]

one

It can be seen that only the first projection feature is retained. It is also easy to understand that our first principal component accounts for 98% of the variance of the projection feature. Selecting only this feature dimension can meet the threshold of 95%. Let's now select the threshold of 99%. The code is as follows:

Pca = PCA (n_components=0.99)

Pca.fit (X)

Print pca.explained_variance_ratio_

Print pca.explained_variance_

Print pca.n_components_

The output at this time is as follows:

[0.98318212 0.00850037]

[3.78483785 0.03272285]

two

This result is also easy to understand, because our first principal component accounts for 98.3% of the variance and the second principal component accounts for 0.8% of the variance, both of which can meet our threshold.

Finally, let's see the effect of letting the MLE algorithm choose its own dimension reduction. The code is as follows:

Pca = PCA (natively qualified employees)

Pca.fit (X)

Print pca.explained_variance_ratio_

Print pca.explained_variance_

Print pca.n_components_

The output is as follows:

[0.98318212]

[3.78483785]

one

It can be seen that because the variance proportion of the first projection feature of our data is as high as 98.3%, the MLE algorithm retains only our first feature.

The above content is how to use scikit-learn tools to reduce PCA dimensionality. Have you learned any knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.