How to realize data Compression with Python 07/02 Update SLTechnology News&Howtos

How to realize data Compression with Python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "how to achieve data compression in Python". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Preface

In the previous article, we have introduced the principle of principal component analysis in detail and used Python's customer credit rating based on principal component analysis to practice.

In that article, we pointed out that there are three common application scenarios of principal component analysis, one of which is "data description". Take the description of the product as an example, such as the famous Boston matrix, the business development of subsidiaries, regional investment potential, etc., need to be multivariable compressed into a few principal components to describe, compressed into two principal components is the best, so that it can be shown in one diagram.

However, this kind of analysis is generally not sufficient to do principal component analysis, and factor analysis can be better. However, the knowledge points of factor analysis are very numerous and complicated, so this paper will skip the principle and use it directly through the case of "actual combat PCA analysis" for a transition from principal component analysis to factor analysis. There are two goals:

The goal of this paper is to be able to estimate the meaning of the generated principal components through the results of principal component analysis so as to draw out the advantages of factor analysis and the necessity of learning. Requirement description

Your boss wants you in the data analysis position to summarize the economic phenomena reflected in the following data sets in just two short sentences.

Using a few long sentences may not be able to well describe the value of the data set, not to mention the highly concise two short sentences, just nine indicators are already a headache, what if the table is wider, such as 20 or 30 variables?

Python actual combat

In this section, we will use Python to analyze the above data

Data exploration import pandas as pd

Import numpy as np

Import matplotlib.pyplot as plt

Plt.style.use ('seaborn-whitegrid')

Plt.rc ('font', * * {' family': 'Microsoft YaHei, SimHei'})

# set support for Chinese fonts

Plt.rcParams ['axes.unicode_minus'] = False

# resolve the problem that the saved image is a negative sign'- 'displayed as a square

Sns.set (font='SimHei') # solves the problem of Chinese display in Seaborn

Df = pd.read_csv ('urban economy .csv')

Before doing principal component analysis, we should explore the correlation between variables. after all, if the variables are independent, they are incompressible.

Plt.figure (figsize= (8,6))

Sns.heatmap (data=df.corr (), annot=True) # annot=True: display numbers

It is found that the correlation between variables is high, and it is necessary to compress variables.

Standardization of PCA modeling data

Using central standardization, that is, variables are transformed into z-scores to avoid the impact of dimensional problems on compression.

From sklearn.preprocessing import scale

Data = df.drop (columns='area') # discard useless category variables

Data = scale (data)

Preliminary modeling

It should be noted that for the first time, the n_components parameter had better be set larger (the retained principal component) and observe the change of explained_variance_ratio_ value, that is, the percentage of variation that each principal component can explain the original data.

From sklearn.decomposition import PCA

Pca = PCA (n_components=9) # Principal components directly equal to the number of variables

Pca.fit (data)

Results Analysis of the degree of cumulative interpretation of variation

Plt.plot (np.cumsum (pca.explained_variance_ratio_), linewidth=3)

Plt.xlabel ('components')

Plt.ylabel ('cumulative explain variance'); plt.grid (True)

It can be seen that when the principal component score is 2, the cumulative explanation variance has reached more than 0.97 (0.85 is enough), indicating that we only need to take two principal components.

Re-modeling

To sum up, two principal components are enough.

Pca = PCA (n_components=2) # Principal components directly equal to the number of variables

Pca.fit (data)

Pca.explained_variance_ratio_

New_data = pca.fit_transform (data) # fit_transform indicates that the reduced-dimensional data will be generated

# View the difference in size

Print ("original dataset size:", data.shape)

Print ("dataset size after Dimension reduction:", new_data.shape)

You can see that nine variables are compressed into two principal components!

Weight Analysis of variables in Principal components

First, look at the coefficient relationship between two principal components and nine variables.

Results = pd.DataFrame (pca.components_). T

Results.columns = ['pca_1',' pca_2']

Results.index = df.drop (columns='area') .columns

Results

It is obvious that:

Principal component 1 is almost not affected by the second independent variable of data, per capita GDP, 0.034, and other independent variables have similar effects on it. Principal component 2 was most affected by the second independent variable of data, per capita GDP, which reached 0.94 result description.

Through the above PCA modeling, we compress nine independent variables into two principal components, and we also know which variables affect each principal component. Although the principal components obtained are of little significance, can we name the two principal components according to the influence of the variables on the principal components?

The first principal component has the same weight in the index of economic aggregate, so it can be named as the economic aggregate level, while the second principal component has a high weight only in the per capita GDP, which can be temporarily considered as the per capita level.

Note: the naming of the principal component here (including subsequent tweets about factor analysis) is carried out on the reduced-dimensional data, not the generated principal component, so that it has the value of comparison and description. The weight of each independent variable on the generated principal component only provides a reference for the naming of the principal component, and the real naming operation is for the compressed data.

New_data = pca.fit_transform (data) # fit_transform indicates that the reduced-dimensional data will be generated

Results = df.join (pd.DataFrame (new_data, # new_data is the dimensionally reduced data

Columns= ['economic aggregate level', 'per capita']) # spliced with the original data

Results

Draw the Boston matrix, the dot marking code of the scatter chart here is the excellent wheel of the predecessors, you can use it directly.

Plt.figure (figsize= (10,8))

# basic scatter plot

X, y = results ['economic aggregate level'], results ['per capita level']

Label = results ['area']

Plt.scatter (x, y)

Plt.xlabel ('economic aggregate level'); plt.ylabel ('per capita level')

# text tagging each point in the scatter chart

# # fixed code, no need to delve into it, just use it

# # to mark a point, you need to separate x and y and the label as in the previous code

For a dagger bjorl in zip (xmemy pr el label):

Plt.text (a, baked 0.1,'% s.'% l, ha='center', va='bottom', fontsize=14)

# add two vertical bars

Plt.vlines (x=results ['economic aggregate level'] .mean ()

Ymin=-1.5, ymax=3, colors='red')

Plt.hlines (y=results ['per capita'] .mean ()

Xmin=-4, xmax=6, colors='red')

Finally, as can be seen from the above picture:

The per capita economic level and economic aggregate level of Guangxi, Hebei, and Fujian are all on the low side. Shanghai's per capita economic level is very high, but the lack of economic gross level is only slightly better than the average. Guangdong's per capita economic level is slightly lower than the average, but the economic gross level is very high. "how to achieve data compression in Python" is introduced here. Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.