In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail the example analysis based on the weather data set XGBoost in Python. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.
1. XGBoost
XGBoost is not a model, but a software package that allows users to easily solve classification, regression, or sorting problems.
Advantages of 1 XGBoost
Easy to use. Compared with other machine learning libraries, users can easily use XGBoost and get pretty good results.
Efficient and scalable. When dealing with large-scale data sets, the speed is fast and the effect is good, and the requirement for hardware resources such as memory is not high.
Strong robustness. Compared with the deep learning model, the close effect can be achieved without fine parameter adjustment.
The lifting tree model is implemented internally in XGBoost, which can automatically deal with missing values.
2 disadvantages of XGBoost
Compared with the deep learning model, it is unable to model the spatio-temporal location, and can not capture high-dimensional data such as image, voice, text and so on.
When we have a large amount of training data and can find a suitable deep learning model, the accuracy of deep learning can be far higher than that of XGBoost.
II. Implementation process 1 dataset
Weather data set extraction code: 1234
2 implement #% import basic library import numpy as np import pandas as pd## drawing function library import matplotlib.pyplot as pltimport seaborn as sns# to read data data=pd.read_csv ('D:\ Python\ ML\ data\ XGBtrain.csv')
View sample data through variable explorer
You can also use the head () or tail () function to view the first and last lines of the sample. It is not difficult to see that there is NAN in the dataset, which indicates that there are missing values in the data, which may be an error in the process of data acquisition or processing. Here,-1 is used to fill the missing values, and there are other filling methods:
Median filling
Average filling
Note: in the pre-processing of data, attention must be paid to the treatment of missing values. The results of the previous data processing will seriously affect whether it is possible to get reasonable results later.
Data=data.fillna (- 1) # use the value_counts () function to view the number of training set tags (Raintomorrow=no) print (pd.Series (data ['RainTomorrow']) .value_counts () data_des=data.describe ()
After filling:
#%% # Visualization data (eigenvalues include numeric and non-numeric features) numerical_features = [x for x in data.columns if data [x] .dtype = = np.float] category_features = [x for x in data.columns if data [x] .dtype! = np.float and x! = 'RainTomorrow'] #% Select scatter visualization sns.pairplot with combination of three features and tags (data=data [[' Rainfall','Evaporation','Sunshine'] + ['RainTomorrow']] Diag_kind='hist', hue= 'RainTomorrow') plt.show ()
#%% box diagram of each feature i=0for col in data [features _ features]. Columns: if col! = 'RainTomorrow': plt.subplot (2) 8) sns.boxplot (x-coded Tomorrowns, y=col, saturation=0.5, palette='pastel', data=data) plt.title (col) i=i+1plt.show ()
#% non-numeric feature tlog = {} for i in category_features: tlog [I] = data [data ['RainTomorrow'] = =' Yes'] [I] .value _ counts () flog = {} for i in category_features: flog [I] = data [data ['RainTomorrow'] = =' No'] [I]. Value_counts () #% Rain in different regions plt.figure (figsize= (20)) plt.subplot (1) 1) plt.title ('RainTomorrow') sns.barplot (x = pd.DataFrame (tlog [' Location']). Sort_index () ['Location'], y = pd.DataFrame (tlog [' Location']). Sort_index (). Index, color = "red") plt.subplot (1mem2) plt.title ('Not RainTomorrow') sns.barplot (x = pd.DataFrame (flog [' Location']). Sort_index () [) ['Location'] Y = pd.DataFrame (flog ['Location']) .sort_index () .index, color = "blue") plt.show ()
#% plt.figure (figsize= (20Power5)) plt.subplot (1meme 2dag1) plt.title ('RainTomorrow') sns.barplot (x = pd.DataFrame (tlog [' RainToday'] [: 2]). Sort_index () ['RainToday'], y = pd.DataFrame (tlog [' RainToday'] [: 2]). Sort_index (). Index, color = "red") plt.subplot (1Yue2) 2) plt.title ('Not RainTomorrow') sns.barplot (x = pd.DataFrame (flog [' RainToday'] [: 2]). Sort_index () ['RainToday'], y = pd.DataFrame (flog [' RainToday'] [: 2]) .sort_index () .index, color = "blue") plt.show ()
XGBoost cannot handle data of string type and needs to convert string data to numeric value
#% Encoding discrete variables # # Encoding all features of the same category to the same value def get_mapfunction (x): mapp = dict (zip (x.unique (). Tolist ()) Range (len (x.unique (). Tolist () def mapfunction (y): if y in mapp: return mapp [y] else: return-1 return mapfunction# discretization of non-digital features for i in category_features: data [I] = data [I]. Apply (get_mapfunction (data [I])) #% use XGBoost for training and preprocessing Test # in order to correctly evaluate the performance of the model The data is divided into training set and test set, and the model is trained on the training set, and the performance of the model is verified on the test set. From sklearn.model_selection import train_test_split## selects samples with categories 0 and 1 (excluding samples with category 2) data_target_part = data ['RainTomorrow'] data_features_part = data [[x for x in data.columns if x! =' RainTomorrow']] # # Test set size is 20% 80% x_test 20% x_train, x_test, y_train, y_test = train_test_split (data_features_part, data_target_part, test_size = 0.2, random_state = 2020) #% import XGBoost model from xgboost.sklearn import XGBClassifier## definition XGBoost model clf = XGBClassifier () # train XGBoost model clf.fit (x_train) on the training set Y_train) #% distributed on training sets and test sets using trained models for prediction train_predict = clf.predict (x_train) test_predict = clf.predict (x_test) from sklearn import metrics## using accuracy (accuracy) [ratio of predicted correct samples to total predicted samples] to evaluate model effectiveness print ('The accuracy of the XGBoost is:',metrics.accuracy_score (y_train) Train_predict)) print ('The accuracy of the XGBoost is:',metrics.accuracy_score (yearly testprecinct)) # # View the confusion matrix (statistical matrix of various cases of predicted and real values) confusion_matrix_result = metrics.confusion_matrix (test_predict,y_test) print (' The confusion matrix result:\ nconfusionconfusion matrixresult) # visualize the results using thermal maps (figsize= (8) 6) sns.heatmap (confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel ('Predicted labels') plt.ylabel (' True labels') plt.show ()
#% use XGBoost for feature selection: the attribute feature_importances_ can be used in # XGboost to see the importance of features. Sns.barplot (yawning datahands imports parts. ColumnsMagnology xroomclf.featureimportances _)
In addition to the first time, we can also use the following important attributes in XGBoost to assess the importance of features:
Weight: it is evaluated by the number of times a feature is used
Gain: evaluation of the Gini index when using features for division
Cover: it is divided by the average value of the index second derivative of a covering sample (the specific principle is not clear to be explored).
Total_gain: total Gini index
Total_cover: total coverage
# use other important parameters of XGBoost to assess the importance of features from sklearn.metrics import accuracy_scorefrom xgboost import plot_importancedef estimate (model,data): # sns.barplot (data.columns,model.feature_importances_) ax1=plot_importance (model,importance_type= "gain") ax1.set_title ('gain') ax2=plot_importance (model,importance_type= "weight") ax2.set_title (' weight') ax3 = plot_importance (model) Importance_type= "cover") ax3.set_title ('cover') plt.show () def classes (data,label,test): model=XGBClassifier () model.fit (data,label) ans=model.predict (test) estimate (model, data) return ans ans=classes
XGBoost includes, but is not limited to, the following parameters that have a significant impact on the model:
Learning_rate: sometimes called eta, the system default is 0. 3. The step size of each iteration is very important. Too big, low accuracy, too small, slow speed.
Subsample: the system defaults to 1. This parameter controls the proportion of random samples for each tree. Reduce the value of this parameter, the algorithm will be more conservative, avoid over-fitting, the value range from zero to one.
Colsample_bytree: the system default is 1. We usually set it to about 0.8. Used to control the proportion of the number of columns randomly sampled per tree (each column is a feature). Max_depth: the system default is 6, and we often use numbers between 3 and 10. This value is the maximum depth of the tree. This value is used to control over-fitting.
The larger the max_depth, the more specific the model is learned.
The methods of adjusting model parameters include greedy algorithm, grid parameter adjustment, Bayesian parameter adjustment and so on. Here we use grid parameter adjustment, its basic idea is exhaustive search: in all candidate parameter selection, through loop traversal, try each possibility, the best performance parameter is the final result.
#% get better results by adjusting parameters # # Import grid tuning function from sklearn.model_selection import GridSearchCV## from sklearn library to define parameter range learning_rate = [0.1,0.3,0.6] subsample = [0.8,0.9] colsample_bytree = [0.6,0.8] max_depth = [3 learning_rate': learning_rate, 'subsample': subsample] parameters 'colsample_bytree':colsample_bytree,' max_depth': max_depth} model = XGBClassifier (n_estimators = 50) # # parameters print (clf.best_params_) after grid search clf = GridSearchCV (model, parameters, cv=3, scoring='accuracy',verbose=1,n_jobs=-1) clf = clf.fit (x_train, y_train) #% grid search
#% predict using the best model parameters on the training set and the test set # # define the XGBoost model with parameters clf = XGBClassifier (colsample_bytree = 0.6, learning_rate = 0.3, max_depth= 8, subsample = 0.9) # train the XGBoost model clf.fit (x_train) on the training set Y_train) train_predict = clf.predict (x_train) test_predict = clf.predict (x_test) # # using accuracy (accuracy) [the ratio of the number of predicted correct samples to the total number of predicted samples] to evaluate the effectiveness of the model print ('The accuracy of the Logistic Regression is:',metrics.accuracy_score (y_test)) Test_predict)) # # View confusion matrix (statistical matrix of various cases of predicted and true values) confusion_matrix_result = metrics.confusion_matrix (test_predict,y_test) print ('The confusion matrix result:\ nconfusionconfusion matrixresult) # use thermal maps to visualize the results plt.figure (figsize= (8,6)) sns.heatmap (confusion_matrix_result, annot=True) Cmap='Blues') plt.xlabel ('Predicted labels') plt.ylabel (' True labels') plt.show ()
3. Important parameters of KeysXGBoost
Eta [default 0.3]: improve the robustness of the model by adding weight to each tree. The typical value is 0.01-0.2.
Min_child_weight [default 1]: determines the minimum leaf node sample weight and sum. This parameter can avoid overfitting. When its value is large, it can prevent the model from learning local special samples. However, if this value is too high, it will lead to insufficient fitting of the model.
Max_depth [default 6]: this value is also used to avoid overfitting. The larger the max_depth, the more specific and local samples the model will learn. Typical value: 3-10
Max_leaf_nodes: the largest number of nodes or leaves on the tree. It can replace the function of max_depth. The definition of this parameter causes the max_depth parameter to be ignored.
Gamma [default 0]: when a node is split, the node will be split only if the value of the loss function decreases after the split. Gamma specifies the minimum loss function reduction required for node splitting. The larger the value of this parameter, the more conservative the algorithm. The value of this parameter is closely related to the loss function.
Max_delta_step [default 0]: this parameter limits the maximum step size of each tree weight change. If the value of this parameter is 0, it means that there are no constraints. If it is assigned a positive value, then it will make the algorithm more conservative. But when all kinds of samples are very unbalanced, it is very helpful to the classification problem.
Subsample [default 1]: this parameter controls the proportion of random samples for each tree. If you reduce the value of this parameter, the algorithm will be more conservative and avoid over-fitting. However, if this value is set too low, it may result in underfitting. Typical value: 0.5-1
Colsample_bytree [default 1]: used to control the percentage of randomly sampled columns per tree (each column is a feature). Typical value: 0.5-1
Colsample_bylevel [default 1]: used to control each split of each level of the tree, the percentage of samples for the number of columns. The subsample parameter and the colsample_bytree parameter can play the same role, but are generally not needed.
Lambda [default 1]: the L2 regularization term of the weight. (similar to Ridge regression) This parameter is used to control the regularized part of the XGBoost. Although this parameter is rarely used by most data scientists, it can be more useful in reducing over-fitting.
Alpha [default 1]: the L1 regularization term of the weight. (similar to Lasso regression) It can be applied in the case of high dimension, which makes the algorithm faster.
Scale_pos_weight [default 1]: when all kinds of samples are very unbalanced, setting this parameter to a positive value can make the algorithm converge faster.
This is the end of the article on "sample Analysis based on Weather dataset XGBoost in Python". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.