Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How python uses PCA to visualize data

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "how python uses PCA to visualize data". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how python uses PCA to visualize data.

What is PCA?

Let's review the theory first. If you want to know exactly how PCA works, we won't go into details. There are plenty of learning resources online.

PCA is used to reduce the number of features used to train the model. It achieves this by constructing so-called principal components (PC) from multiple features.

PC is constructed in such a way that your characteristics are explained as much as possible in terms of the maximum change in PC1 direction. Then PC2 explains the remaining features as much as possible in terms of the maximum change, and so on. PC1 and PC2 can usually explain a large part of the overall characteristic changes.

Another way of thinking is that the first two PC can well summarize most of the features. This is important because, as we will see, it allows us to visualize the ability to classify data on a two-dimensional plane.

Data set

Let's look at a practical example. We will use PCA to explore the breast cancer dataset (http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)), which we import using the following code.

Import numpy as npimport pandas as pdfrom sklearn.datasets import load_breast_cancercancer = load_breast_cancer () data = pd.DataFrame (cancer ['data'], columns=cancer [' feature_names']) data ['y'] = cancer ['target']

The target variable is the result of breast cancer test, malignant or benign. For each test, multiple cancer cells are taken. Then take 10 different measures from each cancer cell. These measurements include cell radius and cell symmetry. Finally, in order to get the eigenvalues, we calculate the average, standard error and maximum (not very good) of each measure, so that we get a total of 30 eigenvalues.

In the figure, we carefully look at two of these features-the average symmetry of the cell (Benign) and the worst smoothness (worst smoothness).

In the figure, we see that these two features can help distinguish between the two classes. That is, benign tumors tend to be more symmetrical and smooth. However, there is still a lot of overlap, so models that use these features alone will not do very well.

We can create such a graph to understand the predictive ability of each feature. But there are 30 features, which means there are quite a few graphs to analyze, and they don't tell us how to predict data sets as a whole. We can introduce PCA to this.

PCA- the entire dataset

First of all, we do principal component analysis on the whole data set. We use the following code.

From sklearn.preprocessing import StandardScalerfrom sklearn.decomposition import PCA# standardization scaler = StandardScaler () scaler.fit (data) scaled = scaler.transform (data) # PCApca = PCA (). Fit (scaled) pc = pca.transform (scaled) pc1 = pc [:, 0] pc2 = pc [:, 1] # draw the principal component plt.figure (figsize= (10J 10)) colour = ['# ff2121' if y = = 1 else'# 2176ff' for y in data ['y']] plt.scatter (pc1,pc2, c=colour) Edgecolors='#000000') plt.ylabel ("Glucose", size=20) plt.xlabel ('Age',size=20) plt.yticks (size=12) plt.xticks (size=12) plt.xlabel (' PC1')

We first standardize the features so that their average is 0 and the variance is 1. This is important because principal component analysis works by maximizing the variance explained by principal component analysis.

Some features naturally have higher variances because they are not standardized. For example, a distance measured in centimeters will have a higher variance than the same distance measured in kilometers. Without scaling features, principal component analysis will be "attracted" by those high variance features.

After zooming, we will fit the PCA model and convert our features to PC. Since we have 30 features, we can have up to 30 PC. However, for visualization, we are only interested in the first two. Then use PC1 and PC2 to create the scatter plot as shown.

In figure 2, we can see two different clusters. Although there is still some overlap, the clusters are much clearer than we did in the previous diagram. This tells us that, as a whole, this data set will do a good job in distinguishing between malignant and benign tumors.

We should also consider that we only focus on the first two PC, so not all feature changes are captured. This means that the model trained with all features can still correctly predict outliers (that is, unclear points in clustering).

At this point, we should mention a warning of this approach. We mentioned that PC1 and PC2 can explain a large part of the differences in your features. However, this is not always true. In this case, these PC can be thought of as a false summary of your characteristics. This means that even if your data can be well separated, you may not be able to get the clear clusters shown above.

We can determine this by looking at the PCA-scree diagram. We use the following code to create a scree diagram for this analysis

Var = pca.explained_variance_ [0:10] # percentage of variance explainedlabels = ['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'] plt.figure (figsize= (15jue 7)) plt.bar (labels,var,) plt.xlabel (' Pricipal Component') plt.ylabel ('Proportion of Variance Explained')

It is essentially a bar chart, where the height of each bar chart is the percentage of variance explained by the relevant PC. We see that only about 20% of the feature variance is explained by PC1 and PC2. Even with an explanation of only 20%, we still get two different clusters. This emphasizes the predictive ability of the data.

PCA- feature group

So far, we have used principal component analysis to understand the classification effect of the whole feature set on data. We can also use this process to compare different feature groups. For example, suppose we want to know whether the symmetry and smoothness of the cell is better than the perimeter and depression of the cell.

Group_1 = ['mean symmetry',' symmetry error','worst symmetry','mean smoothness','smoothness error','worst smoothness'] group_2 = [' mean perimeter','perimeter error','worst perimeter',' mean concavity','concavity error','worst concavity']

Let's first create two sets of features. The first group contains all the features related to symmetry and smoothness, and the second group contains all the features related to perimeter and depression. Then, in addition to using these two sets of features, we conduct principal component analysis in the same way as before. The result of this process is shown in the figure below.

We can see that for the first group, there is some separation, but there is still a lot of overlap. In contrast, group 2 had two different clusters. Therefore, from these images, we can expect that the characteristics of group 2 (that is, cell perimeter and depression) will be a better indicator of whether the tumor is malignant or benign.

Ultimately, this will mean that models that use features in group 2 are more accurate than models that use features in group 1. Now, let's test this hypothesis.

We use the following code to train a logistic regression model that uses two sets of features. In each case, we use 70% of the data to train the model and the remaining 30% to test the model.

From sklearn.model_selection import train_test_splitimport sklearn.metrics as metricimport statsmodels.api as smfor igroup g in enumerate (group): X = data [g] x = sm.add_constant (x) y = data ['y'] x_train, x_test, y_train, y_test = train_test_split Model = sm.Logit. Fit () # fit logistic regression model predictions = np.around (model.predict (x_test)) accuracy = metric.accuracy_score (Accuracy of Group {}: {}) print ("Accuracy of Group {}: {}" .format

The accuracy of the first test set was 74%, compared with 97% of the second group. Therefore, the characteristics of group 2 are obviously better predictors, which is what we can see from the results of principal component analysis.

Finally, we will learn how to use PCA to deepen our understanding of the data before we start modeling. Knowing which features are predictable will give you an advantage in feature selection. In addition, looking at the overall classification ability of features will enable you to understand the expected classification accuracy.

As mentioned earlier, this method is not fully proven and should be used in conjunction with other data exploration maps and summary statistics. In general, it is best to look at the data from as many different angles as possible before you start modeling.

At this point, I believe you have a deeper understanding of "how python uses PCA to visualize data". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report