Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to do data preprocessing with Python

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces you how to use Python to do data preprocessing, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Foreplay

Before you can get a piece of data ready for mining modeling, you need to do a preliminary exploratory analysis of the data (would you like to spend ten minutes systematically understanding the data analysis methods? After the exploratory analysis of the data, it is necessary to carry out a series of data preprocessing steps. Because there are incomplete, inconsistent and abnormal data in the original data, and these "wrong" data will seriously affect the execution efficiency of data mining modeling and even lead to the deviation of mining results, so data cleaning is the first step. After the completion of data cleaning, a series of processes such as data integration, transformation and normalization are carried out at the same time, which is called data preprocessing. On the one hand, it can improve the quality of the data, on the other hand, it can make the data better adapt to the specific mining model. In the actual work, the content of this part may account for 70% or more of the whole work.

01 missing value processing

Due to reasons such as personnel entering data or memory damage, the missing value exists more or less in a piece of data, so it is necessary to deal with the missing value first. the general principle of missing value processing is to use the most possible value instead of the missing value to maximize the relationship between the missing value and other values. The specific common methods are as follows:

Delete missing values (in cases where missing values account for a small proportion)

Manual filling (small dataset, few missing values)

Fill with global variables (fill a constant such as "null" with missing values)

Filled with the mean or median of the sample data

Use interpolation (eg Lagrange method, Newton method)

Python missing value handling example code:

Determine the deletion missing value-isnull,notnull

Judging the missing value can be used to calculate the proportion of the missing value to the whole data, and if the proportion is very small, the missing value can be deleted.

Fill replacement missing value-fillna

If the missing value can not account for a large proportion, the missing value can not be easily deleted, and the missing value can be filled with the above interpolation method.

Core code and result diagram

02 abnormal value handling

Outliers are data that deviate from most of the data in the data set. In terms of data values, the deviation between the data set and the average value is more than twice the standard deviation, in which the deviation from the average value is more than three times the standard deviation (3 σ principle), which is called the highly abnormal abnormal value.

Outlier analysis method

3 σ principle (data distribution is normal distribution)

Box diagram analysis (inner limit or outer limit).

The common treatment methods are as follows:

Delete directly (outliers account for a small proportion)

It is reserved for the time being to be analyzed comprehensively with the whole model.

Fill (mean, etc.) with statistics of existing sample information

Python exception handling example code:

Check whether it conforms to the normal distribution, and judge and deal with it according to the principle of 3 σ. The core code results are as follows:

two。 When it does not conform to the normal distribution, it can be analyzed and processed by the box chart. The core result code is as follows:

03 data standardization processing

Normalization of data is to scale the data proportionally so that it falls into a small specific range. In some index processing of comparison and evaluation, it is often used to remove the unit limit of data and transform it into dimensionless pure value, so that indicators of different units or orders of magnitude can be compared and weighted. The most typical is that data normalization processing is to map the data to [0L1] interval.

Common data standardization methods:

MIN- MAX standardization (x-x_min) / (x_max-x_min)

Z-score standardization (x-x_mean) / x_std

Decimal calibration standardization

Vector normalization

Linear proportional transformation method

Average value method

Exponential conversion method

The purpose of normalization:

So that the preprocessed data is limited to a certain range.

Eliminate the adverse effects caused by singular sample data

I heard a sentence in the video class of Big Ng that normalization will speed up the solution of gradient decline.

Application scenario description:

Optimization problems such as SVM and linear regression need to be normalized, and whether they are normalized or not mainly depends on whether they care about the value of variables.

The neural network needs to be standardized, and the values of general variables are between-1 and 1, in order to weaken the large values of some variables and affect the model. In general, the tanh activation function is better than the sigmod activation function in the hidden layer of the neural network, because the mean value of the tanh hyperbolic tangent function is 0.

In the K-nearest neighbor algorithm, if the explanatory variables are not standardized, the influence of the explanatory variables with decimal order of magnitude will be minimal.

Note: there is no data standardization method, put on every problem, put on each model, can improve the accuracy of the algorithm and accelerate the convergence speed of the algorithm. So there may be different normalization methods for different problems. In classification and clustering algorithms, Z-score standardization performs better when distance is needed to measure similarity, or when PCA technology is used to reduce dimensionality.

04 discretization of continuous attributes of data

Some data mining algorithms, especially classification algorithms, require the data to be in the form of classification attributes. It is often necessary to transform continuous attributes into classification attributes, that is, continuous attributes discretization. Commonly used discretization methods:

Equal width method: the attribute range is divided into intervals with the same width, and the number of intervals is determined by the characteristics of the data itself, or specified by the user, similar to making a frequency distribution table.

Equal frequency method: put the same number of records into each interval.

The method based on cluster analysis. It is generated by box discretization, histogram analysis discretization, clustering, decision tree and correlation analysis discretization, and conceptual stratification of nominal data.

The author records some general data preprocessing steps in the process of data analysis, and realizes each processing method with Numpy, Pandas, Matplotlib and so on.

On how to use Python to do data preprocessing to share here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report