In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
R language data Mining practice Series (4)-- data preprocessing
On the one hand, data preprocessing is to improve the quality of data, on the other hand, it is to make the data better adapt to specific mining technologies or tools. The main contents of data preprocessing include data cleaning, data integration, data transformation and data specification.
I. data cleaning
1. Missing value processing
Generally speaking, the processing of missing values includes two steps, namely, the identification of missing data and the processing of missing values. In R language, the missing value is usually expressed as NA. We can use the function is.na () to judge whether the missing value exists. In addition, the function complete.cases () can identify whether the sample data is complete or not. After judging whether there is a missing value, it is necessary to deal with the missing value, and the commonly used methods are deletion method, replacement method, interpolation method and so on.
(1) deletion method
According to the different angles of data processing, it can be divided into two types: deleting observation samples and deleting variables. Deleting observation samples is also called row deletion method. In R, all rows containing missing data can be removed through the na.omit () function, which belongs to the method of reducing sample size in exchange for information integrity, and is suitable for cases where the proportion of missing values is small. Deleting a variable is suitable for situations where the variable is missing and has little impact on the research goal, which means that the entire variable can be deleted, which can be achieved through data [,-p] in R, where data represents the target dataset and p represents the column in which the missing variable is located.
(2) replacement method
According to the attribute, variables can be divided into numerical type and non-numerical type, and the two methods are different: if the missing value of the variable is numerical, the missing value of the variable is generally replaced by the mean value of the variable in all other objects; if it is a non-numerical variable, it is replaced by the median or mode of all other valid observations of the variable.
(3) interpolation method
In the face of the problem of missing value, the common interpolation methods are regression interpolation, multiple interpolation and so on. The regression interpolation method uses the regression model, which takes the variables that need to be filled by interpolation as dependent variables and other related variables as independent variables, and predicts the value of dependent variables through the regression function lm () to fill the missing variables. The principle of multiple interpolation is to generate a complete set of data from a data set containing missing values, so as to generate a random sample of missing values. The mice function package in R can be used for multiple interpolation.
two。 Exception value handling
Outliers need to be identified before outliers are processed, and univariate scatter plots or box diagrams are generally used to achieve the purpose. In R, the functions dotchart () and boxplot () are used to draw single variable scatter plot and box diagram.
Table 4-1 Common methods for handling outliers
Outlier handling method description deleting records containing outliers directly regard records containing outliers as missing values and regard outliers as missing values. the method of missing value processing is used to modify the average value, which can be corrected by the average value of the two observations. The outlier is not processed and can be mined and modeled directly on the data set with outliers.
In many cases, it is necessary to analyze the possible causes of outliers, and then determine whether outliers should be discarded. If it is correct data, mining modeling can be carried out directly on the dataset with outliers.
II. Data integration
Data integration is the process of merging multiple data sources into a consistent data store.
In R, data integration refers to merging the data stored in two data boxes in column direction based on keywords and behavior units, which can be realized by function merge (). The basic form is merge (data box 1, data box 2). The merged new data is automatically arranged according to the ascending order of keyword values.
Entity identification
Entity recognition is to identify entities in the real world from different data sources, and the task is to unify the contradictions of different data sources, such as synonyms, synonyms and units.
Redundant attribute recognition
Data integration often leads to data redundancy, such as the occurrence of an attribute many times and inconsistent naming of the same attribute, resulting in duplication.
Some redundant attributes can be detected by correlation analysis. Given two numerical attributes An and B, according to their attribute values, the correlation coefficient is used to measure the extent to which one attribute contains another attribute.
III. Data transformation
Data transformation mainly deals with the standardization of data, the discretization of continuous variables and the construction of variable attributes to convert the data into "appropriate" form to meet the needs of mining tasks and algorithms.
Simple function transformation
Simple function transformation is the transformation of some mathematical functions to the original data, such as square, square, logarithm, difference operation and so on. Simple function transformation is often used to transform data without normal distribution into data with normal distribution.
Standardization
In order to eliminate the influence of differences in dimensions and value ranges between indicators, it is necessary to standardize the data and scale the data in proportion to make them fall into a specific area, which is convenient for comprehensive analysis.
Data normalization is particularly important for distance-based mining algorithms.
(1) Min-Max normalization: also known as deviation normalization, is a linear transformation of the original data, mapping the value to [0Jing 1]. The disadvantage of this method is that if the data set and a certain value is very large, the normalized values will be close to 0, and there will be little difference. If the value range of the current attribute [min,max] is exceeded in the future, it will cause a system error, and it is necessary to redetermine the min and max.
(2) Zero-mean normalization: also known as standard deviation standardization, the mean value of processed data is 0, and the standard deviation is 1. This method is currently the most widely used data standardization method, but the mean and standard deviation are greatly affected by outliers, so they usually need to be modified. First, the median M is used to replace the mean, and then the absolute standard deviation is used to replace the standard deviation.
(3) normalization of decimal calibration: by moving the decimal place of the attribute value, the attribute value is mapped to [- 1]. The number of decimal places moved depends on the maximum value of the absolute value of the attribute value.
Discretization of continuous attributes
(1) the process of discretization
The discretization of continuous attributes is to set several discrete points within the value range of the data, divide the value range into some discretized intervals, and finally use different symbols or integer values to represent the data values that fall in each sub-interval. Therefore, discretization involves two subtasks: determining the number of classifications and how to map continuous attribute values to these classification values.
(2) commonly used discretization methods
The commonly used discretization methods are equal width method, equal frequency method and (one-dimensional) clustering.
Equal width method: divide the range of attributes into intervals with the same width, and the number of intervals is determined by the characteristics of the data itself, or specified by the user, similar to making a frequency distribution table.
Equal frequency method: put the same number of records into each interval. The disadvantage is that it is sensitive to outliers and tends to distribute attribute values unevenly to each interval.
(one-dimensional) clustering: it includes two steps: first, the values of continuous attributes are clustered by clustering algorithm, and then the clusters obtained by clustering are processed, merged into the continuous attribute values of a cluster and marked with the same. The discretization method of cluster analysis also requires the user to specify the number of clusters in order to determine the interval number.
Attribute construction
In order to extract more useful information, mine deeper patterns and improve the accuracy of mining results, it is necessary to use the existing attribute set to construct new attributes and add them to the existing attribute set.
Wavelet transform
Wavelet transform is a new type of data analysis tool, which has the characteristics of multi-resolution and has the ability to represent the local characteristics of the signal in both time domain and frequency domain. it provides a time-frequency analysis method of non-stationary signal, which can observe the signal step by step from coarse to fine and extract useful information.
The characteristic quantity that can describe a problem is often hidden in one or some components of a signal. Wavelet transform can decompose the non-stationary signal into data sequences that express different levels and different frequency bands, namely wavelet coefficients. The appropriate wavelet coefficients are selected to complete the feature extraction of the signal.
(1) feature extraction method based on wavelet transform.
The feature extraction methods based on wavelet transform include multi-scale energy distribution feature extraction based on wavelet transform, modulus Maxima feature extraction based on wavelet transform, feature extraction based on wavelet packet transform and adaptive wavelet neural network. Feature extraction.
Table 4-2 feature extraction based on Wavelet transform
The feature extraction method based on wavelet transform describes the feature extraction method of multi-scale spatial energy distribution based on wavelet transform the smooth signal and detail signal in each scale space can provide the time-frequency local information of the original signal, especially the composition information of signals in different frequency bands. By solving the energy of the signals on different decomposition scales, these energy scales can be arranged sequentially to form feature vectors for identification. The modulus Maxima feature extraction method based on wavelet transform in multi-scale space uses the signal localization analysis ability of wavelet transform to solve the modulus Maxima feature of wavelet transform to detect the local singularity of the signal. Taking the scale parameter s, translation parameter t and their amplitude of the modulus Maxima of wavelet transform as the feature values of the target, the feature extraction method based on wavelet packet transform is based on wavelet packet transform. the random signal sequence in time domain can be mapped into random system sequence in each subspace of scale domain, and the uncertainty of random coefficient sequence in the optimal subspace obtained by wavelet packet decomposition is the lowest. Taking the entropy value of the optimal subspace and the position parameters of the optimal subspace in the complete binary tree as feature quantities, it can be used in target recognition feature extraction method based on adaptive wavelet neural network. Based on the adaptive wavelet neural network feature extraction method, the signal can be extracted by analyzing wavelet fitting representation.
(2) Wavelet basis function
The wavelet basis function is a function with local support, and the average value is 0, and the wavelet odd function satisfies ψ (0) = ∫ ψ (t) dt=0. The commonly used wavelet bases are Haar wavelet bases, db series wavelet bases and so on.
(3) Wavelet transform
(4) feature extraction of multi-scale spatial energy distribution based on wavelet transform.
The wavelet analysis technology can be used to extract the features of the signal in each frequency band. The multi-scale spatial energy distribution feature extraction method based on wavelet transform is to analyze the frequency band of the signal. Then the calculated energy of each frequency band is used as the feature vector.
IV. Data protocol
A data specification can produce a smaller new dataset that maintains the integrity of the original data. The significance of data specification is to reduce the impact of invalid and wrong data on modeling and improve the accuracy of modeling; a small amount of representative data will greatly reduce the time needed for data mining; reduce the cost of storing data.
Attribute specification
Attribute specification creates new attribute dimensions through attribute merging, or directly reduces data dimensions by deleting irrelevant attributes (dimensions), so as to improve the efficiency of data mining and reduce computational costs. The goal of attribute specification is to find the smallest attribute subset and to ensure that the probability distribution of the new data subset is as close as possible to that of the original data set. Common methods of attribute specification are:
Merge attributes: combine some old attributes into new attributes
Step-by-step selection: starting with an empty attribute set, each time you select a current optimal attribute from the original attribute set to add to the current attribute subset. Until the optimal attribute cannot be selected or a certain threshold constraint is met.
Delete step by step: starting with a full property set, select the worst attribute from the current attribute subset each time and remove it from the current attribute subset. Until the worst attribute cannot be selected or a certain threshold constraint is met
Decision tree induction: an initial decision tree is obtained by classifying and learning the initial data by using the induction method of the decision tree, and all the attributes that do not appear on the decision tree can be regarded as irrelevant attributes. therefore, a better subset of attributes can be obtained by deleting these attributes from the initial set.
Principal component analysis: use fewer variables to explain most of the variables in the original data, that is, many highly correlated variables are transformed into independent or unrelated variables.
Numerical specification
The numerical specification reduces the amount of data by selecting alternative and smaller data, including parametric methods and non-parametric methods. The parameter method uses a model to evaluate the data, which only needs to store the parameters, but does not need to store the actual data, such as regression (linear regression and multiple regression) and logarithmic linear model (approximate multi-dimensional probability distribution in discrete attribute sets). The non-parameter method needs to store the actual data, such as histogram, clustering, sampling.
(1) histogram
The histogram of attribute A divides the data distribution of An into disjoint subsets or buckets. If each bucket represents only a single attribute value / frequency pair, the bucket becomes a single barrel. Typically, a bucket represents a contiguous interval of a given attribute. In R, the histogram is drawn with the function hist () to illustrate the distribution of the values of variables.
(2) clustering
Clustering technology regards data meta-ancestors (that is, records, a row in a data table) as objects. It divides objects into clusters, making objects in one cluster "similar" to each other and "different" from objects in other clusters. In a data specification, the actual data is replaced with a cluster of data. The effectiveness of this technique depends on whether the definition of cluster conforms to the distribution of data. The commonly used clustering functions in R are hclust () and kmeans (). The former is used when using systematic clustering method, and the latter is a fast clustering function.
(3) sampling
Sampling is also a data specification technique, which uses random samples (subsets) that are much smaller than the original data to represent the original data set. The sampling types are: put back simple random sampling, non-put simple random sampling, cluster sampling, stratified sampling and so on. When used in data specification, sampling is most commonly used to estimate the results of aggregate queries. Within the specified error range, you can determine (using the central limit theorem) the sample size required to estimate a given function. Usually the sample size s is very small relative to N.
(4) Parameter regression
Simple linear model and logarithmic linear model can be used to approximate given data. The (simple) linear model models the data to fit a straight line, which can be realized by using the function lm () in R.
Logarithmic linear model: used to describe the relationship between expected frequencies and covariables (variables that are linearly related to dependent variables and which are controlled by statistical techniques when discussing the relationship between independent variables and dependent variables). Logarithmic linear models are generally used to approximate discrete multi-dimensional probability distributions.
5. Main data preprocessing functions of R language
Table 4-3 R main data preprocessing functions
The function package lm () to which the function name function belongs establishes a linear regression model using dependent variables and independent variables. General function package predict () predicts data based on existing models. General function package mice () performs multiple interpolation for missing data. Mice function package which () returns the location of observations subject to conditions scale () carries out zero-mean normalization of the data general function package rnorm () Random generation of a column number general function package ceiling () rounding up close integer general function package kmeans () for fast clustering analysis of data general function package dwt () for wavelet decomposition of data waveslim function package princomp () for principal component analysis of index variable matrix
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.