How to realize PCA dimensionality reduction algorithm by Python 07/12 Update SLTechnology News&Howtos

How to realize PCA dimensionality reduction algorithm by Python

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article Xiaobian for you to introduce in detail "Python how to achieve PCA dimensionality reduction algorithm", the content is detailed, the steps are clear, the details are handled properly, I hope that this "Python how to achieve PCA dimensionality reduction algorithm" article can help you solve your doubts, following the editor's ideas slowly in depth, together to learn new knowledge.

I. Overview of algorithms

Principal component analysis (Principal ComponentAnalysis,PCA) is a statistical analysis method to grasp the principal contradictions of things. It can analyze the main influencing factors from multiple things, reveal the nature of things, and simplify complex problems.

PCA is the most commonly used dimensionality reduction method, its goal is to map high-dimensional data to low-dimensional space through some linear projection, and expect the maximum variance of data on the projected dimensions, so as to use fewer dimensions and retain more dimensions of the original data.

The goal of the PCA algorithm is to find the eigenvalues and Eigenvectors of the covariance matrix of the sample data, and the direction of the eigenvector of the covariance matrix is the direction that PCA needs to project. After the sample data is projected to the low dimension, the original data can be represented as much as possible.

PCA can combine high-dimensional variables with correlation into linearly independent low-dimensional variables, which are called principal components. The principal component can retain the information of the original data as much as possible.

PCA is usually used for the exploration and visualization of high-dimensional datasets, and can also be used for data compression and preprocessing.

2. Algorithm steps

1. The original data is formed into a matrix X with m rows and n columns by rows.

two。 Zero averaging each column of X (representing an attribute field), that is, subtracting the mean of this column

3. Find out the covariance matrix

4. The eigenvalues of the covariance matrix and the corresponding eigenvector r are obtained.

5. The eigenvectors are arranged in columns according to the corresponding eigenvalues from left to right, and the first k columns are taken to form the matrix P

6. Calculate the data reduced to k dimensions

III. Related concepts

Variance: describes the degree of discretization of a data

Covariance: describes the correlation between two data. Close to 1 is positive correlation, close to-1 is negative correlation, and close to 0 is irrelevant.

Covariance matrix: the covariance matrix is a symmetric matrix, and the diagonal is the variance of each dimension.

Eigenvalues: K eigenvalues for dimensionality reduction

Feature vectors: K feature vectors for dimensionality reduction

Fourth, advantages and disadvantages of the algorithm

Advantages

The amount of information only needs to be measured by variance and is not affected by factors other than the data set.

The principal components are orthogonal, which can eliminate the factors that influence each other among the components of the original data.

The calculation method is simple, and the main operation is eigenvalue decomposition, which is easy to implement.

Shortcoming

The meaning of each feature dimension of the principal component has a certain fuzziness, which is not as strong as the explanation of the original sample feature.

The non-principal components with small variance may also contain important information about sample differences, and the reduced-dimensional discarded data may have an impact on the subsequent data processing.

5. Implementation of the algorithm

Custom implementation

Import numpy as np# makes zero averaging of initial data def zeroMean (dataMat): # column mean meanVal = np.mean (dataMat, axis=0) # column difference newData = dataMat-meanVal return newData, meanVal# reduces the dimensionality of initial data def pca (dataMat, percent=0.19): newData, meanVal = zeroMean (dataMat) # Covariance matrix covMat = np.cov (newData Rowvar=0) # find eigenvalues and Eigenvectors eigVals, eigVects = np.linalg.eig (np.mat (covMat)) # extract the first n Eigenvectors n = percentage2n (eigVals) Percent) print ("data reduced to:" + str (n) + 'dimension') # sort eigenvalues from smallest to largest eigValIndice = np.argsort (eigVals) # take the subscript n_eigValIndice = eigValIndice of the largest n eigenvalues [- 1str-(n + 1):-1] # take the eigenvector n_eigVect = eigVects of the largest n eigenvalues [: N_eigValIndice] # get the data reduced to n dimensions lowDataMat = newData * n_eigVect reconMat = (lowDataMat * n_eigVect.T) + meanVal return reconMat, lowDataMat, n # determine the number of extracted feature vectors def percentage2n (eigVals) by variance percentage Percentage): # sort in descending order sortArray = np.sort (eigVals) [- 1 eigVals] # summation arraySum = sum (sortArray) tempSum = 0 num = 0 for i in sortArray: tempSum + = I num + = 1 if tempSum > = arraySum * percentage: return numif _ _ name__ = ='_ main__': # initialize the raw data (lines represent samples Column represents dimension) data = np.random.randint (1,20, size= (6,8) print (data) # data dimensionality reduction processing fin = pca (data, 0.9) mat = fin [1] print (mat)

Using Sklearn library to realize

Import matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.datasets import load_iris# load data data = load_iris () x = data.datay = data.target# set the dimension of the dataset to be reduced pca = PCA (n_components=2) # to reduce the dimension of the data _ x = pca.fit_transform (x) red_x, red_y = [], [] green_x, green_y = [], [] blue_x, blue_y = [] [] # classify datasets for i in range (len (reduced_x)): if y [I] = 0: red_x.append (reduced_ x [I] [0]) red_y.append (reduced_ x [I] [1]) elif y [I] = 1: green_x.append (reduced_ x [I] [0]) green_y.append (reduced_x [ I] [1]) else: blue_x.append (reduced_ x [I] [0]) blue_y.append (reduced_ x] [1]) plt.scatter (red_x Red_y, cymbals, marker='x') plt.scatter (green_x, green_y, cymbals, marker='D') plt.scatter (blue_x, blue_y, cations, marker='.') plt.show () VI. Algorithm optimization.

PCA is a linear feature extraction algorithm, which rearranges a group of features according to their importance from small to large to get a set of unrelated new features, but the algorithm uses equal weight in the process of constructing a subset, ignoring that the contribution of different attributes to classification is different.

KPCA algorithm

KPCA is an improved PCA nonlinear dimensionality reduction algorithm, which uses the idea of kernel function to transform the sample data nonlinearly, and then carries out PCA in the transform space, thus realizing the nonlinear PCA.

Local PCA algorithm

Local PCA is an improved PCA local dimensionality reduction algorithm, which adds a regular term with local smoothness when looking for principal components, so that the principal components retain more local information.

Read here, this "Python how to achieve PCA dimensionality reduction algorithm" article has been introduced, want to master the knowledge of this article still need to practice and use to understand, if you want to know more related articles, welcome to pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.