In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains the "course of automatic hyperparametric optimization of algorithm model". The content of the explanation in this article is simple and clear, and it is easy to learn and understand. let's study and study the course of automatic hyperparametric optimization of algorithmic model.
What is a hyperparameter?
There are generally two kinds of parameters in the learner model, one can be estimated from the data, which we call parameters (Parameter). There is also a class of parameters that can not be estimated from the data and can only be designed and specified by human experience, which we call Hyper parameter. A superparameter is a parameter that sets a value before starting the learning process. On the contrary, the values of other parameters are obtained through training.
Hyperparameters:
Define higher-level concepts about the model, such as complexity or learning ability that cannot be learned directly from the data in the training process of the standard model, and need to be pre-defined by setting different values. train different models and choose better test values to determine the parameter space search generally consists of the following parts:
An estimator (regression or classifier)
A parameter space
A search or sampling method to obtain a set of candidate parameters
A cross-validation mechanism
A scoring function
Hyper-parameter optimization method in Scikit-Learn
In the machine learning model, such as the number of decision trees in the random forest, the number of hidden layers and nodes in each layer in the artificial neural network model, the size of the constant in the regular term and so on, they all need to be specified in advance. If the selection of super-parameters is not appropriate, there will be the problem of underfitting or overfitting. In Scikit-Learn, a superparameter is a parameter that sets its value before the learning process begins. Typical examples include C, kernel, gamma and so on.
Class sklearn.svm.SVC (*, Cure 1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)
You can use estimator.get_params () to get the hyperparameter list and current values of the learner model.
Sklearn provides two general hyperparametric optimization methods: grid search and random search.
Cross validation (Cross-Validation)-introduction to CV
In machine learning, generally speaking, we can not apply all of them to the data training model, otherwise we will not have data sets to verify the model, so as to evaluate the prediction effect of our model. In order to solve this problem, there are the following common methods:
The Validation Set Approach (validation set Scheme)
This is the simplest method, and it is easy to think of. We can divide the entire data set into two parts, one for training and the other for verification, which is the training set (training set) and test set (test set) that we often talk about.
However, this simple approach has two drawbacks:
In the end, the selection of models and parameters will largely depend on the way you divide the training set and the test set. Under different partition methods, the change of test MSE is great, and the corresponding optimal degree is also different. Therefore, if the division method of our training set and test set is not good enough, we may not be able to choose the best model and parameters.
This method only uses part of the data to train the model. The larger the amount of data used for model training, the better the effect of the trained model. So the division of training set and test set means that we can not make full use of the data we have, so the effect of the model will be affected to a certain extent.
Based on this background, someone proposed the Cross-Validation method, that is, cross-validation.
Cross-ValidationLOOCV (keep one method)
LOOCV is (Leave-one-out cross-validation). Like Test set approach, the LOOCV method includes the step of dividing a dataset into a training set and a test set. But the difference is that we use only one data as the test set, the other data as the training set, and repeat this step N times (N is the number of data in the dataset).
Assuming that we now have a dataset of n data, the method of LOOCV is to take one data at a time as the only element of the test set, while the other one data is used as a training set to train the model and adjust parameters. The result is that we finally train n models and get a MSE each time. To calculate the final test MSE is to average the n MSE.
Compared with test set approach,LOOCV, it has many advantages. First of all, it is not affected by the test set training set partition method, because each data has been tested separately. At the same time, it uses one data training model, and almost all the data is used, which ensures that the bias of the model is smaller. However, the disadvantage of LOOCV is also obvious, that is, the amount of computation is too large, which is twice as time-consuming as test set approach.
K-fold Cross Validation (k-fold cross-validation)
K-fold cross-validation is different from LOOCV in that each test set will no longer contain only one data, but multiple, and the exact number will be determined according to the selection of K. For example, if Knights 5, then our steps for cross-validation with a 50% discount are as follows:
Divide all data sets into 5 parts
Take one part of the test set each time without repetition, use the other four copies as the training model, and then calculate the MSE of the model on the test set.
The average of 5 times MSE is taken as the last to get MSE.
It is not difficult to understand that LOOCV is actually a special kind of K-fold Cross Validation. Finally, the selection of K is a trade-off of Bias and Variance. The larger K is, the more data of each training set is invested, and the smaller the Bias of the model is. But the higher the K, the greater the correlation before each selected training set (consider the most extreme example, when kicking N, that is, in LOOCV, the training data are almost the same each time). This large correlation will cause the final test error to have a larger Variance. The general K value is 5 or 10.
Grid search GridSearchCV
There are two ways to select hyperparameters: 1) based on experience; 2) choose parameters of different sizes, bring them into the model, and select the parameters that perform best. When selecting super-parameters through way 2, the manual attention adjustment cost is too high, and it is not worth it. For loop or methods similar to for loop are limited by too clear layers, not concise and flexible, high attention cost, and error-prone. GridSearchCV is called grid search cross-validation parameter adjustment, which returns the evaluation index score under all the parameter combinations by traversing all the permutations and combinations of the passed parameters.
GridSearchCV sounds high-end, but it is actually a violent search. Note that this method is useful on small datasets, but not when the dataset is large. When the amount of data is relatively large, we can use a fast tuning method-coordinate drop. It is actually a greedy algorithm: tune the parameters that have the greatest impact on the model until they are optimized, and then tune the next parameter that has the greatest impact, and so on, until all the parameters have been adjusted. The disadvantage of this method is that it may be adjusted to local optimization rather than global optimization, but it saves time and effort.
GridSearchCV uses class sklearn.model_selection.GridSearchCV (estimator, param_grid, scoring=None, n_jobs=None, refit=True, cv='warn', verbose=0, pre_dispatch='2*n_jobs', error_score='raise-deprecating', return_train_score='warn') parameters to explain in detail: estimator: the model used, passing in parameters other than the best parameters to be determined. All models require a score method, or pass in a scoring parameter.
Param_grid: the parameter dictionary that needs to be searched for parameters. The parameter value type is a dictionary (dict) or a list of dictionaries (list). Used to set the parameters to be evaluated and the corresponding parameter values.
Scoring: model evaluation criteria, default None, in this case you need to use the score function; or, for example, scoring='roc_auc', has different evaluation criteria depending on the selected model. The string (function name), or callable object, requires a function signature such as scorer (estimator, X, y); in the case of None, use estimator's error estimation function. The preferred values and function forms of score are specified in detail in the table below.
N_jobs: the number of parallel computing threads, 1: default, can be set to-1 (same as the number of CPU cores), so that you can make full use of all the processors of the machine.
Refit: the default is True, and the program will cross-validate the best parameters from the training set. That is, at the end of the search parameters, fit all the data sets again with the best parameter results.
Cv: cross-validation parameters, acceptable parameters:
The default None, which uses 70% discount cross-validation.
Specify the number of fold
CV splitter
Yield training, test data generator.
Verbose: log redundant length
0: do not output the training process
1: occasional output
Dayda'y'd1: output for each submodel
Pre_dispatch: specifies the total number of parallel tasks distributed. When n_jobs is greater than 1, the data will be copied at each run point, which can cause memory problems, while setting the pre_dispatch parameter can pre-divide the total number of job so that the data is replicated up to pre_dispatch times.
Error_score: the value assigned when an error occurs in the fitting, which will cause an error if set to 'raise'. If a number is set, the default value will be changed from 'raise'' to np.nan in version 22 to raise a warning message for FitFailedWarning.
Return_train_score: if "False", the cv_results_ attribute will not include the training score.
The GridSearchCV object cv_results_: is used to output cv results, either in dictionary form or numpy form, or converted to DataFrame format.
The best estimator obtained by best_estimator_: by searching for a parameter, which is not available when the parameter refit=False
Best_score_:float type, output the best results
The best corresponding parameters of score obtained by best_params_: through grid search
The best_index_: corresponds to the index (cv_results_ array) set by the best candidate parameters. The dict in cv_results _ ['params'] [search.best_index_] gives the parameter setting of the best model and gives the highest average score (best_score_).
Scorer_: score function
Nasty splitsvalidation: the number of cross-validations
The time taken by refit_time_:refit, which is not available when the parameter refit=False
GridSearchCV method decision_function (X): returns the value of the decision function (such as the decision distance in svm)
Fit: run all the parameter combinations on the dataset
Get_params (deep=True): returns the parameters of the estimator
Inverse_transform (Xt): Call inverse_transform on the estimator with the best found params.
Predict (X): returns the predicted result value (0max 1)
Predict_log_proba (X): Call predict_log_proba on the estimator with the best found parameters.
Predict_proba (X): returns the probability value for each category (if there are several categories, return several columns of values)
Score (X, y=None): return function
Set_params (* * params): Set the parameters of this estimator.
Transform (X): use trained parameters on X
Examples of use: from sklearn.model_selection import GridSearchCV from sklearn.svm import SVR from sklearn import datasets dataset = datasets.load_iris () X = dataset.data y = dataset.target grid= GridSearchCV (estimator=SVR (kernel='rbf'), param_grid= {'epsilon':: [0.1,1,10,100],' epsilon': [0.0001, 0.001, 0.01,0.1,1,10], 'gamma': [0.001] 0.01,0.1,1]}, cv=5, scoring='neg_mean_squared_error', verbose=0, n_jobs=-1) grid.fit (X, y) print (grid.best_score_) print (grid.best_params_) Random search RandomizedSearchCV
When we search for hyperparameters, if the number of hyperparameters is small (three or four or less), then we can use grid search, an exhaustive search method. However, when there are a large number of hyperparameters, we still use grid search, then the search time will increase exponentially. Therefore, some people put forward the method of random search, randomly searching hundreds of points in the hyperparametric space, among which there may be relatively small values. This method is faster than the above sparse grid method, and the experimental results show that the random search method is slightly better than the sparse grid method.
RandomizedSearchCV uses a method similar to that of the class GridSearchCV, but instead of trying all possible combinations, he chooses a specific number of random combinations of a random value for each superparameter, which has two advantages:
Compared with the overall parameter space, we can choose a relatively small number of parameter combinations. If you let the random search run, it will explore the different values of each hyperparameter and easily control the amount of computation of the hyperparameter search by setting the number of searches. Adding parameter nodes does not affect performance and does not reduce efficiency. The use of RandomizedSearchCV is actually the same as that of GridSearchCV, but it takes the place of GridSearchCV's grid search for parameters by sampling randomly in the parameter space. For parameters with continuous variables, RandomizedSearchCV will sample them as a distribution, which is impossible for grid search, and its search ability depends on the set n_iter parameters.
RandomizedSearchCV instructions class sklearn.model_selection.RandomizedSearchCV (estimator, param_distributions, *, n_iter=10, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score=nan, return_train_score=False) differs from GridSearchCV in the following two parameters: param_distributions: parameter distribution, dictionary format. Combine the parameters in the model we passed into a dictionary. Its search strategy is as follows:
For the super parameter whose search range is distribution, random sampling based on the given distribution
For superparameters whose search range is list, medium probability sampling at a given list
N_iter: train 300times. The higher the value, the greater the precision of the parameters obtained, but the longer the search time, the longer the search time. Use example:
From scipy.stats import randint as sp_randint from sklearn.model_selection import RandomizedSearchCV from sklearn.datasets import load_digits from sklearn.ensemble import RandomForestClassifier # load data digits = load_digits () X, y = digits.data, digits.target # build a classifier or regression clf = RandomForestClassifier (n_estimators=20) # given parameter search range: list or distribution param_dist = {"max_depth": [3, None] # given list "max_features": sp_randint (1,11), # given distribution "min_samples_split": sp_randint (2,11), # given distribution "bootstrap": [True, False], # given list "criterion": ["gini" "entropy"]} # given list # using RandomSearch+CV to select hyperparameter n_iter_search = 20 random_search = RandomizedSearchCV (clf, param_distributions=param_dist, n_iter=n_iter_search, cv=5, iid=False) random_search.fit (X, y) print (random_search.best_score_) print (random_search.best_params_) automatic hyperparametric optimization method Bayesian optimization method (Bayesian Optimization)
Bayesian optimization for machine learning parameter adjustment was proposed by J. Snoek (2012). The main idea is that given the objective function of optimization (generalized function, only need to specify input and output, no need to know the internal structure and mathematical properties), by constantly adding sample points to update the posterior distribution of the objective function (Gaussian process, until the posterior distribution is basically consistent with the real distribution. To put it simply, it takes into account the information of the last parameter so as to better adjust the current parameter.
The difference between Bayesian optimization and conventional grid search or random search is that Bayesian parameter adjustment adopts Gaussian process and constantly updates a priori by considering previous parameter information; grid search does not take into account the previous parameter information.
Bayesian parameter adjustment has few iterations and fast speed, and slow grid search speed can easily lead to dimension explosion when there are many parameters.
Bayesian parameter adjustment is still robust for non-convex problems, and grid search is easy to get local optimization for non-convex problems.
Bayesian optimization provides an elegant framework to find the global minimum in as few steps as possible.
Let's construct a function c (x) or a model that receives input x, as shown in the following figure, the shape of c (x). Of course, the optimizer does not know this function and calls it the "objective function".
Bayesian optimization accomplishes the task through agent optimization. The proxy function is simulated by sampling points (see figure below).
According to the proxy function, we can roughly determine which points are the minimum possible values. Then do more samples near these points, and then update the proxy function.
With each iteration, we continue to look at the current proxy function, learn more about areas of interest by sampling, and update the function. It should be noted that the mathematical expression of substitute functions will greatly reduce the cost of evaluation. After a certain number of iterations, we are destined to reach a global minimum, unless the shape of the function is very strange.
Let's take a closer look at the alternative function, usually expressed as a Gaussian process, which can be thought of as a roll of dice that returns a function that fits a given data point (such as sin, log), rather than a number from 1 to 6. This process returns several functions, all of which are accompanied by probability. There is a good reason why Gaussian processes, rather than other curve fitting methods, are used to simulate alternative functions: it is Bayesian. The substitute function-expressed as a probability distribution, that is, a priori-is updated to "get function". This function is responsible for proposing new test points in the tradeoff between exploration and development.
"developer" strives to sample where the target predicted by the substitute model is good. This is to take advantage of the known promising points. However, if we have explored a certain area enough, there will be nothing to gain from constantly using known information.
Exploration strives to sample at locations with high uncertainty. This ensures that no major area of the space goes unexplored-the global minimum may be right there.
A fetch function that encourages too much development and too little exploration will cause the model to stay at the minimum it first finds (usually local-"only go where there is light"). A fetch function that encourages the opposite will not first stay at a minimum, local or global. Produce good results in a delicate balance. The acquisition function, which we represent as a (x), must consider both development and exploration. Common acquisition functions include expected improvement and maximum improvement probability, all of which measure the probability that a particular investment may be rewarded in the future given a priori information (Gaussian process).
Let's put these things together. Bayesian optimization can be done in this way.
Hongmeng official Strategic Cooperation to build HarmonyOS Technology Community
Initializes the prior distribution of a Gaussian process "substitute function".
Select several data points x to maximize the acquisition function a (x) running on the current prior distribution.
Evaluate the data point x in the target cost function c (x) and get the result, y.
Update the prior distribution of the Gaussian process with the new data to produce a posteriori (which will be the prior of the next step).
Repeat steps 2-5 for multiple iterations.
Explain the current Gaussian process distribution (which is very cheap) to find the global minimum.
Bayesian optimization is to put the idea of probability theory behind the idea of substitution optimization. To sum up: substitute optimization uses substitute function or approximate function to estimate the objective function by sampling.
Bayesian optimization puts the substitute optimization in the probability framework, and the substitute function is expressed as a probability distribution, which can be updated according to the new information.
The acquisition function is used to evaluate the probability that "good" benefits will be generated at a certain point in the exploration space under currently known prior conditions, so as to balance exploration and development.
Bayesian optimization is mainly used when the cost of objective function evaluation is very high, and it is often used for super-parameter adjustment.
Hyperopt
Hyperopt is a powerful Python library for hyperparameter optimization, developed by jamesbergstra. Hyperopt adjusts parameters in the form of Bayesian optimization, allowing you to get the best parameters for a given model. It can optimize models with hundreds of parameters on a large scale.
Hyperopt contains four important features: 1. Search space
Hyperopt has different functions to specify the range of input parameters, which are random search spaces. Select the most commonly used search options:
Choice (label, options)-this can be used for classification parameters, which returns one of the options, which should be a list or tuple. Example: hp.choice ("criterion", ["gini", "entropy",])
Randint (label, upper)-can be used as an integer parameter, which returns a random integer in the range (0quotient upper). Example: hp.randint ("max_features", 50)
Uniform (label, low, high)-it returns a value between low and high. Example: hp.uniform ("max_leaf_nodes", 1pm 10)
Other options you can use include:
Normal (label, mu, sigma)-this returns an actual value that obeys a normal distribution with a mean of mu and a standard deviation of sigma
Qnormal (label, mu, sigma, Q)-returns a value similar to round (normal (mu, sigma) / Q) * Q
Lognormal (label, mu, sigma)-returns exp (normal (mu, sigma))
Qlognormal (label, mu, sigma, Q)-returns a value similar to round (exp (normal (mu, sigma)) / Q) * Q
2. Objective function
This is a minimization function that takes the hyperparameter value from the search space as input and returns the loss. This means that in the process of optimization, we use the selected super-parameter numerical training model and predict the target characteristics, and then evaluate the prediction error and return it to the optimizer. The optimizer will decide which values to check and iterate again. You will learn how to create an objective function in a practical example.
3. The fminfmin function is an optimization function that iterates through different algorithm sets and their hyperparameters, and then minimizes the objective function. Fmin has five inputs:
Minimized objective function
Defined search space
The search algorithms used are random search, TPE (Tree-Parzen estimator) and adaptive TPE. Note: rand.suggest and hyperopt.tpe.suggest provide logic for sequential searches in hyper-parameter spaces.
Maximum number of evaluations
Trials object (optional)
4. Test object
The Trials object is used to hold all hyperparameters, losses, and other information, which means that you can access them after running the optimization. In addition, trials can help you save and load important information, and then continue the optimization process.
The use of Hyperopt
Now that you understand the important features of Hyperopt, here's how to use Hyperopt.
Initialize the space to search
Define the objective function
Select the search algorithm to use
Run the hyperopt function
Analyze the evaluation output stored in the test object
From sklearn import datasets from hyperopt import fmin, tpe, hp, STATUS_OK, Trials from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score iris = datasets.load_iris () X = iris.data y = iris.target def hyperopt_train_test (params): clf = KNeighborsClassifier (* * params) return cross_val_score (clf, X, y). Mean () # Definitions parameter space space_knn = {'naught neighbors: hp.choice, range (1) Define minimization function (objective function) def fn_knn (params): acc = hyperopt_train_test (params) return {'loss':-acc,' status': STATUS_OK} # hyperopt minimization function So add the minus sign # to instantiate the Trial object in acc, fine-tune the model, and then print the list of dictionaries returned by "objective" during the search for the best loss trials= Trials () best = fmin (fn_knn, space_knn, algo=tpe.suggest, max_evals=100, trials=trials) print ("Best: {}" .format (best)) print (trials.results) # search. Algo specifies the search algorithm, which currently supports the following algorithms: random search (hyperopt.rand.suggest)
Simulated annealing (hyperopt.anneal.suggest)
TPE algorithm (tpe.suggest, algorithm full name is Tree-structured Parzen Estimator Approach)
In addition to Hyperopt, the Python packages of Bayesian optimization methods include:
Https://github.com/optuna/optuna
Https://github.com/fmfn/BayesianOptimization
Https://github.com/HIPS/Spearmint
Genetic algorithm (Genetic Algorithms)
Genetic algorithm attempts to apply natural selection mechanism to machine learning environment. It is inspired by Darwin's natural selection process, so it is often called evolutionary algorithm. Suppose we create N machine learning models with some predefined hyperparameters. We can then calculate the accuracy of each model and decide to keep only half of the models (the best performing models). Now, we can generate offspring with hyperparameters similar to the best model, in order to obtain the population of N models again. At this point, we can calculate the accuracy of each model again and repeat the cycle in the defined generation. In this way, only the best model can survive at the end of the process.
TPOT is a Python automatic machine learning tool based on genetic algorithm to optimize machine learning pipeline (pipeline). Simply put, TPOT can intelligently explore thousands of possible pipeline and find the best pipeline for data sets, thus achieving the most tedious part of machine learning.
More importantly, once TPOT finishes searching, TPOT also provides Python code. Through this code, we can know the specific content of pipeline when TPOT gets the best performance, which is very convenient for subsequent modification!
TPOT is a wrapper library based on sklearn. It mainly encapsulates the model-related module, processesing module and feature_selection module of sklearn, so the main function of TPOT is to use pipeline to complete model data preprocessing, feature selection and model selection. In addition, we found that TPOT already supports xgboost.
Although TPOT uses genetic algorithm instead of traditional grid search for super-parameter selection, due to the randomness of the default initial value, the final model chosen by TPOT is often different under a small number of evolution (iterations).
The problem of computational efficiency. The more the number of evolution (iterations) and the number of individuals retained in each generation, the higher the final score of the model, the author wrote in the code. But it can also take a long time. If you use a fairly complex dataset or run TPOT for a short time, different TPOT runs may result in different pipelined recommendations. TPOT's optimization algorithm is random in nature, which means that it uses randomness (in part) to search for possible pipeline space. When two TPOT runs recommend different pipes, this means that the TPOT run does not converge due to insufficient time, or that multiple pipes execute roughly the same number of times on the dataset. This is actually an advantage over fixed grid search techniques: TPOT is an assistant that provides ideas on how to solve specific machine learning problems by exploring pipelined configurations that you may never have considered, and then leaves fine-tuning to more constrained parameter tuning techniques, such as grid search.
Using TPOT (version 0.9.5) to develop a model requires the following:
Necessary data cleaning and feature engineering operations are needed before modeling with TPOT.
TPOT can only do supervised learning at present.
At present, the main classifiers supported by TPOT are Bayesian, decision tree, integrated tree, SVM, KNN, linear model, xgboost.
At present, the main regression machines supported by TPOT are decision tree, integration tree, linear model and xgboost.
TPOT will do further processing operations on the input data, such as binarization, clustering, dimensionality reduction, standardization, regularization, mono-thermal coding and so on.
According to the model effect, TPOT will do feature selection operations on the input features, including the percentage based on tree model, variance and F-value.
The training process can be exported as a .py file in the form of sklearn pipeline through the export () method.
Sample code:
From tpot import TPOTClassifier from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split iris = load_iris () X = iris.data y = iris.target X_train, X_test, y_train, y_test = train_test_split (X, y, test_size=0.2) tpot = TPOTClassifier (generations=5, population_size=50, verbosity=2, n_jobs=-1) tpot.fit (X_train, y_train) print (tpot.score (X_test, y_test))
Main parameters of TPOT:
Generations-determines the number of iterations to create a child (new individual)
Population_size-the initial number of individuals to be created (these are used to create offspring)
Offspring_size-the number of new individuals to be created in each generation
Mutation_rate-probability of random changes in attribute values (including methods of new parameters, which may not be available in the initial population)
Crossover_rate-percentage of individuals used to create offspring
Using this iterative process, we choose the best configuration. The results of preparing genetic algorithms generally depend on the initial state. Therefore, it randomly produces an initial population that affects the output, and rerunning the same settings may output different results.
Thank you for your reading. the above is the content of the course of automatic hyperparametric optimization of algorithm model. After the study of this article, I believe you have a deeper understanding of the tutorial of automatic hyperparametric optimization of algorithm model. The specific use of the situation also needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.