In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what are the knowledge points of python data analysis". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
Data preprocessing
On the one hand, data preprocessing is to improve the quality of data, on the other hand, it is to make the data better adapt to specific mining technologies or tools. Statistics show that in the process of data mining, the workload of data preprocessing accounts for 60% of the whole process.
4.1. Data cleaning
Data cleaning is mainly to delete the irrelevant data and duplicate data in the original data set, smooth the noise data, filter out the data that has nothing to do with the main topic, and deal with missing values, outliers and so on.
4.1.1, missing value handling
The methods of dealing with missing values can be divided into three categories: deleting records, data interpolation and not processing. The commonly used data interpolation methods
Table 4-1 commonly used interpolation methods
The interpolation method describes the mean / median / mode interpolation according to the type of attribute value, the average / median / mode of the attribute value is interpolated with a fixed value to replace the missing attribute value with a constant. For example, the vacancy value of the "basic wage" attribute of ordinary migrant workers in a factory in Guangzhou can use the 2015 salary standard of 1895 yuan per month for ordinary migrant workers in Guangzhou. This method is to use the fixed value nearest temporary interpolation to find the sample closest to the missing sample in the record. The attribute value interpolation regression method is used for variables with missing values. Based on the existing data and the data of other related variables (dependent variables), a fitting model is established to predict the missing attribute value interpolation. The interpolation method uses the known points to establish a suitable interpolation function f (x). The unknown value is approximately replaced by the function value f (xi) obtained by the corresponding point X.
If the goal is achieved by simply deleting a small number of records, then deleting records with missing values is the most effective method. However, this method has great limitations. It is to reduce the historical data in exchange for the completeness of the data, will cause a lot of waste of resources, will discard a large number of information hidden in these records. Especially when the dataset already contains few records, deleting a small number of records may seriously affect the objectivity and correctness of the analysis results. Some models can regard the missing value as a special value, which allows modeling directly on the data containing the missing value.
This section focuses on Lagrangian interpolation and Newton interpolation. Other interpolation methods include Hermite interpolation, piecewise interpolation, spline interpolation and so on.
Newton interpolation is also polynomial interpolation, but another method is used to construct interpolation polynomials. compared with Lagrange interpolation, Newton interpolation has the characteristics of inheritance and easy to change nodes. In essence, the results given by the two are the same (polynomials with the same degree and the same coefficient), but in different forms. Therefore, in the Scipy library of Python, only the functions of Lagrangian interpolation are provided (because it is easy to implement), and if you need Newton interpolation, you need to write it yourself.
Listing 4-1, interpolated by Lagrangian method
#-*-coding:utf-8-*-# Lagrangian interpolation code import pandas as pd # import data analysis library Pandasfrom scipy.interpolate import lagrange # import Lagrangian interpolation function inputfile ='.. / data/catering_sale.xls' # sales volume data path outputfile ='.. / tmp/sales.xls' # output data path data = pd.read_excel (inputfile) # read data data [u 'sales volume'] [(data [u 'sales volume']
< 400) | (data[u'销量'] >5000)] = None # filter outliers and make them null # Custom column Vector interpolation function # s is the column vector, n is the position to be interpolated, k is the number of data before and after Default is 5def ployinterp_column (s, n, y.index 5): y = s [list (range (Numbjk, n)) + list (range (nasty 1, naugh1qk))] # take y = y [y.notnull ()] # eliminate the null value return lagrange (y.index) List (y)) (n) # interpolates and returns the interpolation result # determines whether interpolation is needed element by element: for i in data.columns: for j in range (len (data)): if (data [I]. Isnull ()) [j]: # interpolation if it is empty. Data [I] [j] = ployinterp_column (data [I], j) data.to_excel (outputfile) # output results, write to file 4.1.1, exception handling
In data preprocessing, whether outliers are removed depends on the specific situation, because some outliers may contain useful information. The common methods for handling outliers are shown in Table 4-3.
Table 4-3 Common methods for handling outliers
Outlier handling method description deleting records containing outliers directly regard records containing outliers as missing values and regard outliers as missing values. using the missing value processing method to deal with the average value correction can be corrected by the average value of the two observations before and after the abnormal value is not processed directly in the data set with outliers for mining modeling 4.2, data integration
The data needed by data mining is often distributed in different data sources. Data integration is the process of merging multiple data sources into a consistent data store (such as a data warehouse).
In data integration, the expressions of real-world entities from multiple data sources are different and may not match. The problems of entity identification and attribute redundancy should be considered to transform, refine and integrate the source data at the lowest level.
4.2.1, entity identification
Entity recognition refers to the identification of real-world entities from different data sources, its task is to unify the contradictions of different data sources, the common forms are as follows.
(1) synonyms
The attribute ID in data source An and the attribute ID in data source B describe the dish number and order number, respectively, that is, they describe different entities.
(2) synonyms
Sales_dt in data source An and sales_date in data source B both describe the date of sale, that is, A. sales_dt= B. sales_date.
(3) the unit is not unified.
The international unit and the traditional Chinese unit of measurement are used to describe the same entity.
To detect and resolve these conflicts is the task of entity recognition.
4.2.2, redundant attribute identification
Data integration often leads to data redundancy, such as
The same attribute occurs multiple times
Inconsistent naming of the same attribute results in duplication.
4.3. Data transformation
The main purpose of data transformation is to standardize the data and convert the data into an "appropriate" form, which is suitable for the needs of mining tasks and algorithms.
4.3.1. Simple function transformation
Simple function transformation is to transform the original data into some mathematical functions, including square, square, logarithm, difference operation and so on.
Simple function transformation is often used to transform data without normal distribution into data with normal distribution.
4.3.2. Standardization
Data normalization (normalization) processing is a basic work of data mining. Different evaluation indicators often have different dimensions, and the values may vary greatly, and the results of data analysis may be affected if they are not processed. In order to eliminate the influence of the difference in dimension and value range between indicators, it is necessary to standardize the data and scale the data in proportion to make it fall into a specific area to facilitate comprehensive analysis. For example, the attribute value of salary and income is mapped to [- 1] or [0].
Data normalization is particularly important for distance-based mining algorithms.
(1) minimum-maximum normalization
Min-Max normalization, also known as deviation normalization, is a linear transformation of the original data, mapping numerical values to [0mem1].
(2) Zero-mean normalization
Zero-mean normalization is also called standard deviation standardization. The mean value of processed data is 0 and the standard deviation is 1. Is currently the most widely used method of data standardization.
(3) Standardization of decimal calibration
By moving the number of decimal places of the attribute value, the attribute value is mapped to [- 1Magne1], depending on the maximum absolute value of the attribute value.
Code listing 4-2 data normalization code
#-*-coding: utf-8-*-# data normalization import pandas as pdimport numpy as npdatafile ='.. / data/normalization_data.xls' # Parameter initialization data = pd.read_excel (datafile Header = None) # read data (data-data.min ()) / (data.max ()-data.min ()) # min-maximum normalization (data-data.mean ()) / data.std () # zero-mean normalization data/10**np.ceil (np.log10 (data.abs (). Max ()) # decimal calibration normalization 4.3.3, continuous attribute discretization
Some data mining algorithms, especially some classification algorithms (such as ID3 algorithm, Apriori algorithm, etc.), require the data to be in the form of classification attributes. In this way, it is often necessary to transform continuous attributes into classification attributes, that is, continuous attributes discretization.
The process of discretization
The discretization of continuous attributes is to set several discrete points within the value range of the data, divide the value range into some discretized intervals, and finally use different symbols or integer values to represent the data values that fall in each sub-interval. Therefore, discretization involves two subtasks: determining the number of classifications and how to map continuous attribute values to these classification values.
Commonly used discretization methods
The commonly used discretization methods are equal width method, equal frequency method and (one-dimensional) clustering.
(1) Equal width method
The range of attributes is divided into intervals with the same width, and the number of intervals is determined by the characteristics of the data itself, or specified by users, similar to making a frequency distribution table.
(2) Equal frequency method
Put the same number of records into each interval.
These two methods are simple and easy to operate, but both need to artificially specify the number of intervals. At the same time, the disadvantage of the equal width method is that it is sensitive to outliers and tends to distribute attribute values unevenly to each interval. Some intervals contain a lot of data, while others contain very little data, which can seriously damage the established decision-making model. Although the equal frequency method avoids the above problems, it is possible to divide the same data values into different intervals to meet the fixed number of data in each interval.
(3) the method based on cluster analysis
The method of one-dimensional clustering includes two steps: first, the values of continuous attributes are clustered by clustering algorithms (such as K-Means algorithm), and then the clusters obtained by clustering are processed, merged into the continuous attribute values of a cluster and marked with the same. The discretization method of cluster analysis also requires the user to specify the number of clusters in order to determine the interval number.
Listing 4-3 data discretization
#-*-coding: utf-8-*-# data normalization import pandas as pddatafile ='.. / data/discretization_data.xls' # Parameter initialization data = pd.read_excel (datafile) # read data data = data [u 'liver qi stagnation syndrome coefficient'] .copy () k = 4d1 = pd.cut (data, k, labels = range (k)) # isometric discretization Each analogy is named 0pere1Personality 2Magnum isofrequency discretization w = [1.0*i/k for i in range (knot 1)] w = data.describe (percentiles = w) [4:4+k+1] # using the describe function to automatically calculate the quantile w [0] = w [0] * (1-1e-10) D2 = pd.cut (data, w, labels = range (k)) def cluster_plot (d K): # customize the mapping function to display the clustering result import matplotlib.pyplot as plt plt.rcParams ['font.sans-serif'] = [' SimHei'] # to display the Chinese label plt.rcParams ['axes.unicode_minus'] = False # to display the negative sign plt.figure (figsize = (8,3)) for j in range (0, k): plt.plot (data [d = = j] [j for i in d],'o') plt.ylim (- 0.5,0.5) return pltif _ name__=='__main__': from sklearn.cluster import KMeans # introduce KMeans kmodel = KMeans (n_clusters=k, n_jobs=4) # to build a model N_jobs is the number of parallelism, which is generally equal to the number of CPU. (data.values.reshape ((len (data), 1)) # training model c = pd.DataFrame (kmodel.cluster_centers_). Sort_values (0) # outputs the clustering center and sorts (by default random order) w = c.rolling (2). Mean (). Iloc [1:] # find the midpoint of two adjacent items As the boundary point w = [0] + list (w [0]) + [data.max ()] # add the first and last boundary points to d3 = pd.cut (data, w, labels=range (k)) cluster_plot (D1, k). Show () cluster_plot (D2, k). Show () cluster_plot (D3, k). Show ()
Equal width discretization result
Equal frequency discretization result
(one-dimensional) clustering discretization result
4.3.4, attribute construction
In the process of data mining, in order to extract more useful information, mine deeper patterns, and improve the accuracy of mining results, we need to use the existing attribute set to construct new attributes and add them to the existing attribute set.
Listing 4-4 Line loss rate attribute construction
#-*-coding: utf-8-*-# Line loss rate attribute Construction import pandas as pd# parameter initialization inputfile='. / data/electricity_data.xls' # Power supply data outputfile ='.. / tmp/electricity_data.xls' # data file data = pd.read_excel (inputfile) # read data data [u 'line loss rate] = (data [u' power supply']-data [u 'power supply']) / data [u 'power supply'] data.to_excel (outputfile Index = False) # Save result 4.3.5, wavelet transform
Wavelet transform is a new type of data analysis tool, which is a rising means of signal analysis in recent years. The theory and method of wavelet analysis is more and more widely used in the fields of signal processing, image processing, speech processing, pattern recognition, quantum physics and so on. It is considered as a major breakthrough in tools and methods in recent years. Wavelet transform has the characteristic of multi-resolution and has the ability to represent the local characteristics of the signal in both time domain and frequency domain. the multi-scale focusing analysis of the signal is carried out through the operation process of stretching and translation, which provides a method of time-frequency analysis of non-stationary signal. useful information can be extracted from the signal by observing the signal step by step from coarse to fine.
The characteristic quantity that can describe a problem is often hidden in one or some components of a signal. Wavelet transform can decompose the non-stationary signal into data sequences that express different levels and different frequency bands, namely wavelet coefficients. The appropriate wavelet coefficients are selected to complete the feature extraction of the signal. The method of signal feature extraction based on wavelet transform will be introduced below.
(1) feature extraction method based on wavelet transform.
The main feature extraction methods based on wavelet transform are: multi-scale space energy distribution feature extraction based on wavelet transform, modulus Maxima feature extraction based on wavelet transform, feature extraction based on wavelet packet transform, feature extraction based on adaptive wavelet neural network, see Table 4-5 for details.
Table 4-5 feature extraction based on Wavelet transform
The feature extraction method based on wavelet transform describes the feature extraction method of multi-scale spatial energy distribution based on wavelet transform the smooth signal and detail signal in each scale space can provide the time-frequency local information of the original signal, especially the composition information of signals in different frequency bands. By solving the energy of the signals on different decomposition scales, these energy scales can be arranged sequentially to form feature vectors for identification. The feature extraction method of modulus Maxima in multi-scale space based on wavelet transform uses the signal localization analysis ability of wavelet transform to solve the modulus Maxima characteristic of wavelet transform to detect the local singularity of the signal, and the scale parameter S and translation parameter of wavelet transform modulus Maxima are used. The feature extraction method based on wavelet packet transform can map the random signal sequence in time domain to the random coefficient sequence in each subspace of scale domain. the uncertainty of the random coefficient sequence in the optimal subspace obtained by wavelet packet decomposition is the lowest, and the direct value of the optimal subspace and the position parameters of the optimal subspace in the complete binary tree are taken as feature quantities. The feature extraction method based on adaptive wavelet neural network can be used for target recognition. The signal can be represented by analysis wavelet fitting and feature extraction can be carried out based on adaptive wavelet neural network.
The wavelet transform can be used to extract the features of the acoustic signal and extract the vector data that can represent the acoustic signal, that is, to complete the transformation from the acoustic signal to the eigenvector data.
In Python, Scipy itself provides some signal processing functions, but not comprehensive enough, and a better signal processing library is PyWavelets (pywt).
Code listing 4-5, wavelet transform feature extraction code
#-*-coding: utf-8-*-# feature analysis using wavelet analysis # Parameter initialization inputfile='.. / data/leleccum.mat' # the signal file from scipy.io import loadmat # mat extracted from Matlab is a special format for MATLAB, which needs to be read by loadmat mat = loadmat (inputfile) signal = mat ['leleccum'] [0] import pywt # Import PyWaveletscoeffs = pywt.wavedec (signal,' bior3.7', level = 5) # return level+1 digits The first array is the array of approximation coefficients, followed by the array of detail coefficients 4.4 and the data specification.
It takes a long time for complex data analysis and mining on big data set, and the data specification produces a new data set that is smaller but maintains the integrity of the original data. It will be more efficient to analyze and mine the data set after the specification.
The significance of the data protocol is:
Reduce the influence of invalid and wrong data on modeling and improve the accuracy of modeling.
A small amount of representative data will greatly reduce the time required for data mining.
Reduce the cost of storing data.
4.4.1, attribute specification
Attribute specification creates new attribute dimensions through attribute merging, or directly reduces data dimensions by deleting irrelevant attributes (dimensions), so as to improve the efficiency of data mining and reduce computational costs. The goal of attribute specification is to find the smallest attribute subset and to ensure that the probability distribution of the new data subset is as close as possible to that of the original data set.
Step by step forward selection, step by step backward deletion and decision tree induction belong to the method of directly deleting irrelevant attributes (dimensions). Principal component analysis is a data dimensionality reduction method for continuous attributes, which constructs an orthogonal transformation of the original data, and the base of the new space removes the correlation of the data under the original spatial base. Most of the variations in the original data can be explained by using only a few new variables. In application, several new variables, the so-called principal components, which are less than the original variables and can explain most of the variables in the data, are usually selected to replace the original variables for modeling.
Calculate the principal components:
In Python, the function of principal component analysis is located under Scikit-Leam:
Sklearn.decomposition.PCA (n_components = None,copy = True,whiten = False)
Parameter description:
(1) n_components
Meaning: the number of principal components to be retained in PCA algorithm n, that is, the number of features retained n.
Type: int or string. Default is None. All components are retained. Assigning a value of int, such as n_components = 1, reduces the original data to one dimension. A value of string, such as n_components = 'mle', will automatically select the number of features n to meet the required percentage of variance.
(2) copy
Type: bool, True or False. Default is True.
Meaning: indicates whether to make a copy of the original training data when running the algorithm. If True, the value of the original training data will not change after running the PCA algorithm, because the operation is carried out on the copy of the original data; if False, the value of the original training data will be changed after running the PCA algorithm, because the dimensionality reduction calculation is carried out on the original data.
(3) whiten
Type: bool. Default is False.
Meaning: whitening, so that each feature has the same variance.
The program that uses principal component analysis to reduce dimensionality is shown in listing 4-6.
Code listing 4-6, principal component analysis dimensionality reduction code
#-*-coding: utf-8-*-# Principal component analysis reduced dimension import pandas as pd# parameter initialization inputfile ='.. / data/principal_component.xls'outputfile ='.. / tmp/dimention_reducted.xls' # reduced data data = pd.read_excel (inputfile Header = None) # read data from sklearn.decomposition import PCApca = PCA () pca.fit (data) print (pca.components_) # return each eigenvector of the model print (pca.explained_variance_ratio_) # return the percentage of variance of each component (contribution rate)''the greater the percentage of variance It shows that when the weight of the vector is larger, the cumulative contribution rate of the first four principal components has reached 97.37%, indicating that it is quite good to select the first three principal components for calculation, so we can re-establish the PCA model, set n_components=3, and calculate the component results. Pca = PCA (n_components=3) pca.fit (data) low_d = pca.transform (data) # use it to reduce the dimension print (low_d) pd.DataFrame (low_d). To_excel (outputfile) # Save the result pca.inverse_transform (low_d) # inverse_transform () function can restore data 4.4.2, numerical specification if necessary
Numerical specification refers to reducing the amount of data by selecting alternative and smaller data, including parametric methods and non-parametric methods. The parameter method uses a model to evaluate the data, which only needs to store the parameters, but does not need to store the actual data, such as regression (linear regression and multiple regression) and logarithmic linear model (approximate multi-dimensional probability distribution in discrete attribute sets). The non-parameter method needs to store the actual data, such as histogram, clustering, sampling (sampling).
4.5. main data preprocessing functions of Python
Table 4-7 Python main data preprocessing functions
The function name function belongs to the extended library of interpolation Scipyunique for one-dimensional and high-dimensional data to remove repeated elements in the data and get a list of single-valued elements. it is the method name of the object Pandas/Numpyisnull to judge whether it is null or not Pandasnotnull to judge whether it is non-null PandasPCA to perform principal component analysis of index variable matrix Scikit-Leamrandom to generate random matrix Numpy
(1) interpolate
1) function: interpolate is a sublibrary of Scipy, which contains a large number of interpolation functions, such as Lagrangian interpolation, spline interpolation, high-dimensional interpolation and so on. Before use, you need to introduce the corresponding interpolation function with from scipy.interpolate import *. Readers should look up the corresponding function name on the official website as needed.
2) use the format: F = scipy.interpolate.lagrange (x, y). Only the command for Lagrangian interpolation of one-dimensional data is shown here, where x _ ray y is the corresponding independent and dependent variable data. After the interpolation is completed, the new interpolation result can be calculated by f (a). Similarly, there are spline interpolation, multidimensional data interpolation and so on, which are not shown here.
(2) unique
1) function: remove the repeated elements in the data and get a list of single-valued elements. It is both a function of the Numpy library (np.unique ()) and a method of the Series object.
2) use the format:
Np.unique (D), D is one-dimensional data, can be list, array, Series
D.unique (), D is the Series object of Pandas.
3) example: find the single-valued element in vector An and return the relevant index.
> D = pd.Series ([1,1,2,3,5]) > D.unique () array ([1,2,3,5], dtype=int64) > np.unique (D) array ([1,2,3,5], dtype=int64)
(3) isnull/ notnull
1) function: determine whether each element is null / non-null.
2) use the format: D.isnull () / D.notnull (). The D requirement here is a Series object that returns a Boolean Series. You can find null / non-null values in D through D [D.isnull ()] or D [D.notnull ()].
(4) random
1) function: random is a sublibrary of Numpy (Python itself has its own random, but Numpy is more powerful). Various functions under this library can be used to generate random matrices that obey a specific distribution, which can be used when sampling.
2) use the format:
Np.random.rand (k, m, n, …) Generate a k x m x n x... A random matrix whose elements are uniformly distributed on the interval (0pl).
Np.random.randn (k, m, n, …) Generate a k x m x n x... Random matrix whose elements obey standard normal distribution.
(5) PCA
1) function: principal component analysis of index variable matrix. You need to introduce this function with from skleam.decomposition import PCA before using it.
2) use format: model = PCA (). Note that PCA under Scikit-Leam is a modeling object, that is to say, the general process is modeling, then training model.fit (D), D is the data matrix for principal component analysis, and after training, get the parameters of the model, such as .components _ get the feature vector, and .components _ variance_ratio_ get the contribution rate of each attribute.
3) example: principal component analysis of a 10x4-dimensional random matrix using PCA ().
From sklearn.decomposition import PCAD = np.random.rand (1010) pca = PCA () pca.fit (D) PCA (copy=True, n_components=None, whiten=False) pca. Components_ # returns each eigenvector of the model pca.explained_variance_ratio_ # returns the variance percentage of each component. This is the end of the introduction of "what are the knowledge points of python data analysis". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.