What are the four steps of data preprocessing? 04/19 Update SLTechnology News&Howtos

What are the four steps of data preprocessing?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

I would like to share with you what the four steps of data preprocessing are. I believe most people don't know much about it, so share this article for your reference. I hope you will learn a lot after reading this article. let's learn about it!

The four steps of data preprocessing are data cleaning, data integration, data transformation and data reduction, while data preprocessing refers to the necessary processing such as auditing, screening and sorting before classifying or grouping the collected data. data preprocessing, on the one hand, is to improve the quality of data, on the other hand, it is also to adapt to the software or methods of data analysis.

Data preprocessing refers to the necessary processing such as audit, screening and sorting before classifying or grouping the collected data.

On the one hand, data preprocessing is to improve the quality of data, on the other hand, it is also to adapt to the software or methods of data analysis. Generally speaking, data preprocessing steps include: data cleaning, data integration, data transformation, data reduction, and each big step has some small details. Of course, these four big steps may not have to be performed when doing data preprocessing.

I. data cleaning

Data cleaning, as the name implies, "black" becomes "white", "dirty" data becomes "clean", and dirty data is dirty in form and content.

Formal dirt, such as missing values, with special symbols

Dirt on the content, such as outliers.

1. Missing value

The missing value includes the identification and processing of the missing value.

The missing value in R is identified by the function is.na, and the function complete.cases identifies whether the sample data is complete or not.

The common methods to deal with missing values are: deletion, replacement and interpolation.

Deletion method: according to the different angles of deletion, deletion method can be divided into delete observation samples and variables, delete observation samples (row deletion method), and na.omit function can delete rows with missing values in R.

This is equivalent to reducing the sample size in exchange for the completeness of the information, but when the variable is missing and has little impact on the research goal, consider deleting the variable R using the statement mydata [,-p] to complete. Mydata represents the name of the deleted data set, p is the number of columns of the deleted variable, and-indicates deletion.

Replacement method: the replacement method replaces the missing value as its name implies, and there are different replacement rules according to different variables. the variable of the missing value is numerical to replace the missing value with the mean of other numbers under the variable; when the variable is a non-numerical variable, it is replaced by the median or mode of other observations under the variable.

Interpolation: interpolation is divided into regression interpolation and multiple interpolation.

Regression interpolation means that the interpolated variable is regarded as the dependent variable y, other variables are misunderstood as independent variables, the regression model is used for fitting, and the lm regression function is used in R to interpolate the missing value.

Multiple interpolation means that a complete set of data is generated from a data set containing missing values, which is carried out many times to produce a random sample of missing values. In R, mice packets can carry out multiple interpolation.

2. Outliers

Like missing values, outliers include the identification of outliers and the treatment of outliers.

The identification of outliers is usually handled by univariate scatter chart or box chart. In R, dotchart is the function of drawing univariate scatter chart, and boxplot function draws box diagram; in the graph, the points far away from the normal range are regarded as outliers.

The treatment of outliers includes deleting the observations containing outliers (directly deleting, directly deleting when the sample is small, it will cause insufficient sample size and changing the distribution of variables), as missing values (using existing information to fill the missing values), average correction (correcting the outliers with the means of the two observations) and not dealing with them. When dealing with outliers, we should first review the possible causes of outliers, and then judge whether outliers should be discarded.

II. Data integration

The so-called data integration is to merge multiple data sources into one data store, of course, if the analyzed data is already in a data store, there is no need for data integration (multi-in-one).

The realization of data integration is that the two data boxes are based on keywords and implemented by merge function in R, and the statement is merge (dataframe1, dataframe2,by= "keywords"), which is arranged in ascending order by default.

The following problems may occur during data integration:

Synonymous with the same name, the name of an attribute in data source An is the same as that in data source B, but the entity represented is different and cannot be used as a keyword

Synonym, that is, two data sources have a different attribute name but represent the same entity, which can be used as a keyword

Data integration often results in data redundancy, which may be caused by the repeated occurrence of the same attribute for many times, or it may be caused by inconsistent attribute names. For a duplicate attribute, do correlation analysis and detection first, and then delete it if there is any.

III. Data transformation

Data transformation is transformed into an appropriate form to meet the needs of software or analysis theory.

1. Simple function transformation

Simple function transformation is used to change data without normal distribution into data with normal distribution, such as square, square, logarithm, difference and so on. For example, the logarithmic or differential operation of the data is often performed in the time series to transform the non-stationary sequence into a stationary series.

2. Standardization

Standardization is to eliminate the dimensional influence of variables, such as: direct comparison of height and weight differences, different units and different value ranges so that this matter can not be directly compared.

Min-Max normalization: also known as deviation normalization, which transforms the data linearly and changes its range to [0quoi 1]

Zero-mean normalization: also known as standard deviation standardization, the processed data mean is equal to 0, the standard deviation is 1

Decimal scaling normalization: moves the number of decimal places of an attribute value, mapping the attribute value to [- 1meme1]

3. Discretization of continuous attributes

Transforming continuous attribute variables into classification attributes is the discretization of continuous attributes, especially some classification algorithms require the data to be classification attributes, such as ID3 algorithm.

The commonly used discretization methods are as follows:

Equal width method: divide the range of attributes into intervals with the same width, similar to making frequency distribution tables

Equal frequency method: put the same records in each interval

One-dimensional clustering: two steps: first, the value of continuous attributes is used by clustering algorithm, and then the set obtained by clustering is merged into a continuity value and marked with the same.

IV. Data reduction

Data reduction refers to finding the useful features of the data that depend on the discovery target on the basis of understanding the mining task and the content of the data itself, so as to reduce the size of the data and keep the original appearance of the data as much as possible. minimize the amount of data.

Data normalization can reduce the impact of invalid error data on modeling, reduce time, and reduce the space to store data.

1. Attribute reduction

Attribute reduction is to find the smallest subset of attributes and determine that the probability distribution of the subset is close to the original data.

Merge attributes: merge some old attributes into a new one

Step-by-step selection: start with an empty attribute set, select a current optimal attribute in the original attribute set and add it to the current subset until the optimal attribute cannot be selected or a constraint value is satisfied

Step-by-step selection: start with an empty attribute set, select the current worst attribute in the original attribute set and eliminate the current subset until the worst attribute cannot be selected or a constraint value is satisfied

Decision tree induction: attributes that do not appear on this decision tree are deleted from the initial set to obtain a better subset of attributes

Principal component analysis: use fewer variables to explain most of the variables in the original data (using highly correlated variables to be independent or unrelated to each other).

2. Numerical reduction

By reducing the amount of data, including parametric and non-parametric methods, with parameters such as linear regression and multiple regression, non-parametric methods such as histogram, sampling and so on.

The above is all the contents of the article "what are the four steps of data preprocessing?" Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.