In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article introduces the relevant knowledge of "how to achieve data compression in Python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Preface
In the previous article, we have introduced the principle of principal component analysis in detail and used Python's customer credit rating based on principal component analysis to practice.
In that article, we pointed out that there are three common application scenarios of principal component analysis, one of which is "data description". Take the description of the product as an example, such as the famous Boston matrix, the business development of subsidiaries, regional investment potential, etc., need to be multivariable compressed into a few principal components to describe, compressed into two principal components is the best, so that it can be shown in one diagram.
However, this kind of analysis is generally not sufficient to do principal component analysis, and factor analysis can be better. However, the knowledge points of factor analysis are very numerous and complicated, so this paper will skip the principle and use it directly through the case of "actual combat PCA analysis" for a transition from principal component analysis to factor analysis. There are two goals:
The goal of this paper is to be able to estimate the meaning of the generated principal components through the results of principal component analysis so as to draw out the advantages of factor analysis and the necessity of learning. Requirement description
Your boss wants you in the data analysis position to summarize the economic phenomena reflected in the following data sets in just two short sentences.
Using a few long sentences may not be able to well describe the value of the data set, not to mention the highly concise two short sentences, just nine indicators are already a headache, what if the table is wider, such as 20 or 30 variables?
Python actual combat
In this section, we will use Python to analyze the above data
Data exploration import pandas as pd
Import numpy as np
Import matplotlib.pyplot as plt
Plt.style.use ('seaborn-whitegrid')
Plt.rc ('font', * * {' family': 'Microsoft YaHei, SimHei'})
# set support for Chinese fonts
Plt.rcParams ['axes.unicode_minus'] = False
# resolve the problem that the saved image is a negative sign'- 'displayed as a square
Sns.set (font='SimHei') # solves the problem of Chinese display in Seaborn
Df = pd.read_csv ('urban economy .csv')
Df
Before doing principal component analysis, we should explore the correlation between variables. after all, if the variables are independent, they are incompressible.
Plt.figure (figsize= (8,6))
Sns.heatmap (data=df.corr (), annot=True) # annot=True: display numbers
It is found that the correlation between variables is high, and it is necessary to compress variables.
Standardization of PCA modeling data
Using central standardization, that is, variables are transformed into z-scores to avoid the impact of dimensional problems on compression.
From sklearn.preprocessing import scale
Data = df.drop (columns='area') # discard useless category variables
Data = scale (data)
Preliminary modeling
It should be noted that for the first time, the n_components parameter had better be set larger (the retained principal component) and observe the change of explained_variance_ratio_ value, that is, the percentage of variation that each principal component can explain the original data.
From sklearn.decomposition import PCA
Pca = PCA (n_components=9) # Principal components directly equal to the number of variables
Pca.fit (data)
Results Analysis of the degree of cumulative interpretation of variation
Plt.plot (np.cumsum (pca.explained_variance_ratio_), linewidth=3)
Plt.xlabel ('components')
Plt.ylabel ('cumulative explain variance'); plt.grid (True)
It can be seen that when the principal component score is 2, the cumulative explanation variance has reached more than 0.97 (0.85 is enough), indicating that we only need to take two principal components.
Re-modeling
To sum up, two principal components are enough.
Pca = PCA (n_components=2) # Principal components directly equal to the number of variables
Pca.fit (data)
Pca.explained_variance_ratio_
New_data = pca.fit_transform (data) # fit_transform indicates that the reduced-dimensional data will be generated
# View the difference in size
Print ("original dataset size:", data.shape)
Print ("dataset size after Dimension reduction:", new_data.shape)
You can see that nine variables are compressed into two principal components!
Weight Analysis of variables in Principal components
First, look at the coefficient relationship between two principal components and nine variables.
Results = pd.DataFrame (pca.components_). T
Results.columns = ['pca_1',' pca_2']
Results.index = df.drop (columns='area') .columns
Results
It is obvious that:
Principal component 1 is almost not affected by the second independent variable of data, per capita GDP, 0.034, and other independent variables have similar effects on it. Principal component 2 was most affected by the second independent variable of data, per capita GDP, which reached 0.94 result description.
Through the above PCA modeling, we compress nine independent variables into two principal components, and we also know which variables affect each principal component. Although the principal components obtained are of little significance, can we name the two principal components according to the influence of the variables on the principal components?
The first principal component has the same weight in the index of economic aggregate, so it can be named as the economic aggregate level, while the second principal component has a high weight only in the per capita GDP, which can be temporarily considered as the per capita level.
Note: the naming of the principal component here (including subsequent tweets about factor analysis) is carried out on the reduced-dimensional data, not the generated principal component, so that it has the value of comparison and description. The weight of each independent variable on the generated principal component only provides a reference for the naming of the principal component, and the real naming operation is for the compressed data.
New_data = pca.fit_transform (data) # fit_transform indicates that the reduced-dimensional data will be generated
Results = df.join (pd.DataFrame (new_data, # new_data is the dimensionally reduced data
Columns= ['economic aggregate level', 'per capita']) # spliced with the original data
Results
Draw the Boston matrix, the dot marking code of the scatter chart here is the excellent wheel of the predecessors, you can use it directly.
Plt.figure (figsize= (10,8))
# basic scatter plot
X, y = results ['economic aggregate level'], results ['per capita level']
Label = results ['area']
Plt.scatter (x, y)
Plt.xlabel ('economic aggregate level'); plt.ylabel ('per capita level')
# text tagging each point in the scatter chart
# # fixed code, no need to delve into it, just use it
# # to mark a point, you need to separate x and y and the label as in the previous code
For a dagger bjorl in zip (xmemy pr el label):
Plt.text (a, baked 0.1,'% s.'% l, ha='center', va='bottom', fontsize=14)
# add two vertical bars
Plt.vlines (x=results ['economic aggregate level'] .mean ()
Ymin=-1.5, ymax=3, colors='red')
Plt.hlines (y=results ['per capita'] .mean ()
Xmin=-4, xmax=6, colors='red')
Finally, as can be seen from the above picture:
The per capita economic level and economic aggregate level of Guangxi, Hebei, and Fujian are all on the low side. Shanghai's per capita economic level is very high, but the lack of economic gross level is only slightly better than the average. Guangdong's per capita economic level is slightly lower than the average, but the economic gross level is very high. "how to achieve data compression in Python" is introduced here. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.