In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about how to use sklearn for data mining, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
1.1 steps for data Mining
data mining usually includes data acquisition, data analysis, feature engineering, training model, model evaluation and other steps. Feature engineering and model training can be easily carried out by using sklearn tools. In "using sklearn to do stand-alone feature engineering", we finally left some questions: there are three methods in feature processing classes: fit, transform and fit_transform,fit methods have the same name as model training method fit (not only with the same name, but also with the same parameter list). Is this all a coincidence?
obviously, this is not a coincidence, this is exactly the design style of sklearn. We can use sklearn for feature engineering and model training more elegantly. At this point, you might as well start with a basic data mining scenario:
Data mining process
We use sklearn to work in dotted frames (sklearn can also extract text features). By analyzing the sklearn source code, we can see that in addition to training, prediction and evaluation, the classes that deal with other work have implemented three methods: fit, transform and fit_transform. As you can see from the naming, the fit_transform method calls fit first and then transform. We only need to focus on the fit method and the transform method.
transform method is mainly used to transform features. From the point of view of available information, conversion is divided into no information conversion and information conversion. No-information conversion refers to the conversion without using any other information, such as exponential, logarithmic function conversion and so on. Information conversion can be divided into unsupervised conversion and supervised conversion from whether or not to use the target value vector. Unsupervised transformation refers to the transformation of statistical information that only uses features, including mean, standard deviation, boundary and so on, such as standardization, PCA dimensionality reduction and so on. Supervised transformation refers to the transformation of both feature information and target value information, such as feature selection through model selection, dimension reduction by LDA method and so on. By summarizing the commonly used transformation classes, we get the following table:
It is not difficult for to see that only the fit method of the transformation class with information is actually useful. obviously, the main work of the fit method is to obtain the feature information and the target value information. on this point, the fit method and the fit method in model training can be linked together: both extract valuable information by analyzing features and target values, which are some statistics for the transformation class, and may be the weight coefficients of features for the model. In addition, only the fit and transform methods of the supervised transformation class need two parameters: the feature and the target value. The uselessness of the fit method does not mean that it is not implemented, but it does not deal with the features and target values except for the validity check. The fit method of Normalizer is implemented as follows:
does not have a common approach to work based on these characteristics, so can you imagine putting them together? In the scenario assumed in this article, we can see that there are two combinations of these tasks: pipelined and parallel. The work based on the assembly line needs to be carried out in turn, and the output of the previous work is the input of the latter; the parallel work can be carried out at the same time, using the same input, and the respective outputs can be merged after all the work is completed. Sklearn provides package pipeline for pipelined and parallel work.
1.2 initial appearance of the data
is not here, and we still use the IRIS dataset for illustration. In order to adapt to the proposed scenario, the original dataset needs to be slightly processed:
1.3 key technologies
parallel processing, pipeline processing, automatic parameter adjustment and persistence are the core of elegant data mining using sklearn. Parallel processing and pipelining combine multiple feature processing work, even model training work (from a code point of view, multiple objects into one object). Under the premise of combination, the automatic parameter adjustment technology helps us to save the anti-lock of manual parameter adjustment. The trained model is the data stored in memory, and persistence can save the data in the file system, and then load it directly from the file system without training.
2 parallel processing
parallel processing enables multiple feature processing tasks to be carried out in parallel. According to the different ways of reading the characteristic matrix, it can be divided into whole parallel processing and partial parallel processing. Global parallel processing, that is, the input of each work in parallel processing is the whole of the eigenmatrix; partial parallel processing can define the columns of the eigenmatrix that each job needs to input.
2.1 overall parallel processing
The pipeline package provides the FeatureUnion class for overall parallel processing:
The overall parallel processing of has its drawbacks. In some scenarios, we only need to transform some columns of the eigenmatrix, not all columns. Pipeline does not provide a corresponding class, so we need to optimize it on the basis of FeatureUnion.
In the scenario proposed by , we qualitatively encode the first column of the feature matrix (the color of the flower), transform the logarithmic function of the second, third and fourth column, and quantitatively binarize the fifth column. The code for partial parallel processing using the FeatureUnionExt class is as follows:
3 pipelined processing
The pipeline package provides the Pipeline class for pipelining. The fit_transform method is executed on the pipeline except for the last work, and the output of the previous work is used as the input to the next work. The last job must implement the fit method, and the input is the output of the previous job; but there is no limit to the transform method, because the last job of the pipeline may be training!
According to the scenario proposed in this article, , combined with parallel processing, the code to build a complete pipeline is as follows:
4 automatic parameter adjustment
grid search is one of the common technologies for automatic parameter adjustment. Grid_search package provides tools for automatic parameter adjustment, including the GridSearchCV class. The code for training and adjusting parameters for the combined object is as follows:
5 persistence
The externals.joblib package provides dump and load methods to persist and load in-memory data:
6. Review
Note: both composition and persistence involve pickle technology. It is stated in sklearn's technical documentation that functions defined by lambda cannot be pickled as custom conversion functions of FunctionTransformer.
After reading the above, do you have any further understanding of how to use sklearn for data mining? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.