Steps to use Sklearn for data Mining 04/17 Update SLTechnology News&Howtos

Steps to use Sklearn for data Mining

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article introduces the relevant knowledge of "the steps of using Sklearn for data mining". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

1 use sklearn for data mining

1.1 steps for data Mining

Data mining usually includes data acquisition, data analysis, feature engineering, training model, model evaluation and other steps. Feature engineering and model training can be easily carried out by using sklearn tools. In "using sklearn to do stand-alone feature engineering", we * left some questions: there are three methods in feature processing classes: fit, transform and fit_transform,fit methods have the same name as model training method fit (not only the same name, but also the same parameter list). Is this all a coincidence?

Obviously, this is not a coincidence, this is exactly the design style of sklearn. We can use sklearn for feature engineering and model training more elegantly. At this point, you might as well start with a basic data mining scenario:

We use sklearn to work in dotted frames (sklearn can also extract text features). By analyzing the sklearn source code, we can see that in addition to training, prediction and evaluation, the classes that deal with other work have implemented three methods: fit, transform and fit_transform. As you can see from the naming, the fit_transform method calls fit first and then transform. We only need to focus on the fit method and the transform method.

Transform method is mainly used to transform features. From the point of view of available information, conversion is divided into no information conversion and information conversion. No-information conversion refers to the conversion without using any other information, such as exponential, logarithmic function conversion and so on. Information conversion can be divided into unsupervised conversion and supervised conversion from whether or not to use the target value vector. Unsupervised transformation refers to the transformation of statistical information that only uses features, including mean, standard deviation, boundary and so on, such as standardization, PCA dimensionality reduction and so on. Supervised transformation refers to the transformation of both feature information and target value information, such as feature selection through model selection, dimension reduction by LDA method and so on. By summarizing the commonly used transformation classes, we get the following table:

It is not difficult to see that only the fit method of the transformation class with information is actually useful. obviously, the main work of the fit method is to obtain the feature information and the target value information. on this point, the fit method and the fit method in model training can be linked together: both extract valuable information by analyzing features and target values, which are some statistics for the transformation class, and may be the weight coefficients of features for the model. In addition, only the fit and transform methods of the supervised transformation class need two parameters: the feature and the target value. The uselessness of the fit method does not mean that it is not implemented, but it does not deal with the features and target values except for the validity check. The fit method of Normalizer is implemented as follows:

Def fit (self, X, y=None): "Do nothing and return the estimator unchanged This method is just there to implement the usual API and hence work in pipelines." X = check_array (X, accept_sparse='csr') return self

Based on these characteristics, there is a common way to deal with the work, so imagine if they can be put together? In the scenario assumed in this article, we can see that there are two combinations of these tasks: pipelined and parallel. The work based on the assembly line needs to be carried out in turn, and the output of the previous work is the input of the latter; the parallel work can be carried out at the same time, using the same input, and the respective outputs can be merged after all the work is completed. Sklearn provides package pipeline for pipelined and parallel work.

1.2 initial appearance of the data

Here, we still use the IRIS dataset for illustration. In order to adapt to the proposed scenario, the original dataset needs to be slightly processed:

From numpy import hstack, vstack, array, median, nan from numpy.random import choice from sklearn.datasets import load_iris # feature matrix processing # add a row of samples with missing values (nan, nan) using vstack # add a column using hstack to represent the color of the flower (0-white, 1-yellow, 2-red), the color of the flower is random It means that the color does not affect the classification of flowers iris.data = hstack ((choice ([0,1,2], size=iris.data.shape [0] + 1). Reshape (- 1 nan), vstack ((iris.data, array ([nan, nan]). Reshape (1 nan, nan]) # Target value vector processing # add a target value, corresponding to the sample with missing value Value is iris.target = hstack ((iris.target, array ([median (iris.target)])

1.3 key technologies

Parallel processing, pipeline processing, automatic parameter adjustment and persistence are the core of elegant data mining using sklearn. Parallel processing and pipelining combine multiple feature processing work, even model training work (from a code point of view, multiple objects into one object). Under the premise of combination, the automatic parameter adjustment technology helps us to save the anti-lock of manual parameter adjustment. The trained model is the data stored in memory, and persistence can save the data in the file system, and then load it directly from the file system without training.

2 parallel processing

Parallel processing enables multiple feature processing tasks to be carried out in parallel. According to the different ways of reading the characteristic matrix, it can be divided into whole parallel processing and partial parallel processing. Global parallel processing, that is, the input of each work in parallel processing is the whole of the eigenmatrix; partial parallel processing can define the columns of the eigenmatrix that each job needs to input.

2.1 overall parallel processing

The pipeline package provides the FeatureUnion class for overall parallel processing:

From numpy import log1p from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import Binarizer from sklearn.pipeline import FeatureUnion # New object for logarithmic function conversion of global eigenmatrix step2_1 = ('ToLog', FunctionTransformer (log1p)) # New object for binarization of global eigenmatrix step2_2 = (' ToBinary', Binarizer ()) # New global parallel processing object # this object also has fit and transform methods Both fit and transform methods call the fit and transform methods of objects that need parallel processing in parallel. The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of two tuples, the * * element is the name of the object, and the second element is the object step2 = ('FeatureUnion', FeatureUnion (transformer_list= [step 2 _ 1, step2_2, step2_3])).

2.2 partial parallel processing

Global parallel processing has its drawbacks. In some scenarios, we only need to transform some columns of the eigenmatrix, not all columns. Pipeline does not provide a corresponding class (only the OneHotEncoder class implements this function), so we need to optimize it on the basis of FeatureUnion:

View Code

In the scene proposed in this paper, we carry on the qualitative feature coding to the first column of the characteristic matrix (the color of the flower), the logarithmic function conversion to the second, third and fourth column, and the quantitative feature binarization to the fifth column. The code for partial parallel processing using the FeatureUnionExt class is as follows:

From numpy import log1p from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import Binarizer # New object for qualitative feature coding of partial feature matrix step2_1 = ('OneHotEncoder', OneHotEncoder (sparse=False)) # New object for logarithmic function conversion of partial feature matrix step2_2 = (' ToLog', FunctionTransformer (log1p)) # New object of binarization class for partial feature matrix step2_3 = ('ToBinary') Binarizer ()) # create a partial parallel processing object # Parameter transformer_list is a list of objects that need to be processed in parallel The list is a list of two tuples, the * * element is the name of the object, and the second element is the object # parameter idx_list is the corresponding column of the feature matrix to be read step2 = ('FeatureUnionExt', FeatureUnionExt (transformer_list= [step 2 _ 1, step2_2, step2_3], idx_list= [[0], [1, 2, 3], [4]]))

3 pipelined processing

The pipeline package provides Pipeline classes for pipelining. The fit_transform method is executed on the pipeline except for one job, and the output of the previous job is used as the input to the next job. * A job must implement the fit method, and the input is the output of the previous job, but there is no limit to the transform method, because a pipelined job may be training!

According to the scenario proposed in this article, combined with parallel processing, the code to build a complete pipeline is as follows:

From numpy import log1p

From sklearn.preprocessing import Imputer

From sklearn.preprocessing import OneHotEncoder

From sklearn.preprocessing import FunctionTransformer

From sklearn.preprocessing import Binarizer

From sklearn.preprocessing import MinMaxScaler

From sklearn.feature_selection import SelectKBest

From sklearn.feature_selection import chi2

From sklearn.decomposition import PCA

From sklearn.linear_model import LogisticRegression

From sklearn.pipeline import Pipeline

# create a new object for calculating missing values

Step1 = ('Imputer', Imputer ())

# create a new object for qualitative feature coding of part of the feature matrix

Step2_1 = ('OneHotEncoder', OneHotEncoder (sparse=False))

# create a new object that converts part of the feature matrix into logarithmic function

Step2_2 = ('ToLog', FunctionTransformer (log1p))

# create a new object that binarizes part of the feature matrix

Step2_3 = ('ToBinary', Binarizer ())

# create a new partial parallel processing object, and the return value is the merge of the output of each parallel work

Step2 = ('FeatureUnionExt', FeatureUnionExt (transformer_list= [step 2 _ 1, step2_2, step2_3], idx_list= [[0], [1,2,3], [4]]))

# create a new dimensionless object

Step3 = ('MinMaxScaler', MinMaxScaler ())

# create a new object for chi-square check selection feature

Step4 = ('SelectKBest', SelectKBest (chi2, Kend3))

# create a new object with reduced dimensions in PCA

Step5 = ('PCA', PCA (n_components=2))

# create a new logical regression object, which is the model to be trained as a step of the pipeline

Step6 = ('LogisticRegression', LogisticRegression (penalty='l2'))

# create a pipelined object

# Parameter steps is a list of objects that need to be pipelined, which is a list of two tuples. * element is the name of the object, and the second element is the object.

Pipeline = Pipeline (steps= [step1, step2, step3, step4, step5, step6])

4 automatic parameter adjustment

Grid search is one of the common technologies of automatic parameter adjustment. Grid_search package provides tools for automatic parameter adjustment, including GridSearchCV class. The code for training and adjusting parameters for the combined object is as follows:

From sklearn.grid_search import GridSearchCV

# create a new grid search object

# * * parameters are the model to be trained

# param_grid is a grid of parameters to be adjusted in dictionary format. The key is the parameter name (in the format "object name _ _ sub-object name _ _ parameter name"), and the value is a list of desirable parameter values.

Grid_search = GridSearchCV (pipeline, param_grid= {'FeatureUnionExt__ToBinary__threshold': [1.0,2.0,3.0,4.0],' LogisticRegression__C': [0.1,0.2,0.4,0.8]})

# training and parameter adjustment

Grid_search.fit (iris.data, iris.target)

5 persistence

The externals.joblib package provides dump and load methods to persist and load in-memory data:

# persistent data # * parameters are objects in memory # the second parameter is the name saved in the file system # the third parameter is the compression level, 0 is not compressed, and 3 is the appropriate compression level dump (grid_search, 'grid_search.dmp', compress=3) # loading data from the file system into memory grid_search = load (' grid_search.dmp')

6 Review

Note: both composition and persistence involve pickle technology. It is stated in sklearn's technical documentation that functions defined by lambda cannot be pickled as custom conversion functions of FunctionTransformer.

This is the end of the content of "steps for data Mining with Sklearn". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.