Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of LightGBM in Python

2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of LightGBM in Python, which is very detailed and has certain reference value. Friends who are interested must finish it!

1. Introduction

LightGBM is an extended machine learning system. It is a distributed gradient lifting framework based on GBDT (gradient lifting decision tree) algorithm. The design idea is mainly focused on reducing the use of memory and computing performance of data, as well as reducing the communication cost of multi-machine parallel computing.

Advantages of 1 LightGBM

Easy to use. Provides the mainstream Python\ C++\ R language interface, users can easily use LightGBM modeling and get quite good results.

Efficient and scalable. When dealing with large-scale data sets, it is efficient, rapid and accurate, and does not require high hardware resources such as memory.

Strong robustness. Compared with the deep learning model, the approximate effect can be achieved without fine parameter adjustment.

LightGBM directly supports missing values and category features without additional special processing of the data.

2 disadvantages of LightGBM

Compared with the deep learning model, it is unable to model the spatio-temporal location, and can not capture high-dimensional data such as image, voice, text and so on.

When we have a large amount of training data and can find a suitable deep learning model, the accuracy of deep learning can be far higher than that of LightGBM.

2. implementation process 1. Introduction to the dataset.

League of Legends dataset extraction code: 1234

This data is used for LightGBM classification. The data set has a total of 9881 qualifying data above the diamond Rank of League of Legends hanbok, which provides the state of the game in 10 minutes, including the number of hits, the number of gold coins, experience, rating and other information.

2 Coding# import basic library import numpy as np import pandas as pd## drawing function library import matplotlib.pyplot as pltimport seaborn as sns#%% data read: read and convert to DataFrame format df = pd.read_csv ('D:\ Python\ ML\ data\ high_diamond_ranked_10min.csv') y = df.blueWins#%% to view sample data # print (y.value_counts ()) # label feature column drop_cols= ['gameId' 'blueWins'] x=df.drop (drop_cols,axis=1) # Statistical description of digital features x_des=x.describe ()

#% remove redundant data, because red and blue are competitive and only need to know the situation of one party. On the contrary, the other party removes the red data information drop_cols = ['redFirstBlood','redKills','redDeaths',' redGoldDiff','redExperienceDiff', 'blueCSPerMin',' blueGoldPerMin','redCSPerMin','redGoldPerMin'] x.drop (drop_cols, axis=1, inplace=True) #% visual description. In order to have a good presentation, there are two violin pictures to show the first nine features and the middle nine features. The latter same does not repeat data = xdata_std = (data-data.mean ()) / data.std () data = pd.concat ([y, data_std.iloc [:, 0:9]], axis=1) # concatenate the label with the first nine columns at this time the data is (9879 / 10) metricdata = pd.melt (data, id_vars='blueWins', var_name='Features', value_name='Values') # melt the above data into (88911 / 3) metricfig Ax= plt.subplots (1Jing 2Jing figuration = (15jue 8)) # drawing violin sns.violinplot (xylene featuresque, yawning values, hue='blueWins', data=data, split=True, inner='quart', ax=ax [0], palette='Blues') fig.autofmt_xdate (rotation=45) # realistic method of changing x-axis coordinates Can be expressed obliquely (tilted 45 degrees) You don't have to squeeze it flat into a pile of data= xdata_std = (data-data.mean ()) / data.std () data= pd.concat ([y, data_std.iloc [:, 9:18]], axis=1) data= pd.melt (data, id_vars='blueWins', var_name='Features', value_name='Values') # to draw violin drawings sns.violinplot (xylene featuresses, yawning values, hue='blueWins', data=data, split=True, inner='quart') Ax=ax [1], palette='Blues') fig.autofmt_xdate (rotation=45) plt.show ()

#%% draw the correlation between the various features fig,ax=plt.subplots (figsize= (152.18)) sns.heatmap (round (x.corr (), 2), cmap='Blues',annot=True) fig.autofmt_xdate (rotation=45) plt.show ()

#% according to the above feature map, eliminate the redundant features with strong correlation (redAvgLevel BlueAvgLevel) # remove redundant features drop_cols = ['redAvgLevel','blueAvgLevel'] x.drop (drop_cols, axis=1, inplace=True) sns.set (style='whitegrid', palette='muted') # construct two new features x [' wardsPlacedDiff'] = x ['blueWardsPlaced']-x [' redWardsPlaced'] x ['wardsDestroyedDiff'] = x [' blueWardsDestroyed']-x ['redWardsDestroyed'] data = x [[' blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff'] 'wardsDestroyedDiff']] .sample (1000) data_std = (data-data.mean ()) / data.std () data= pd.concat ([y, data_std], axis=1) data= pd.melt (data, id_vars='blueWins', var_name='Features', value_name='Values') plt.figure (figsize= (155.8)) sns.swarmplot (xylene Featuresses, yawning values, hue='blueWins', data=data) plt.show ()

#% from the discrete chart of the number of holes in the above picture, we can find a significant rule between the number of holes and the outcome of the game, and whether or not the first ten minutes of the game has little effect on the final outcome. Therefore, remove these features # # remove the features related to eye position drop_cols = ['blueWardsPlaced','blueWardsDestroyed','wardsPlacedDiff',' wardsDestroyedDiff','redWardsPlaced','redWardsDestroyed'] x.drop (drop_cols, axis=1, inplace=True) #% the data distribution of kill, death and assists is not significantly different, but the distribution of kill minus death, assists minus death is not significantly different from fate. Construct two new features x ['killsDiff'] = x [' blueKills']-x ['blueDeaths'] x [' assistsDiff'] = x ['blueAssists']-x [' redAssists'] x ['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']] .hist (figsize= (15) 8), bins=20) plt.show ()

#% data = x [['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists']] .sample (1000) data_std = (data-data.mean ()) / data.std () data = pd.concat ([y, data_std], axis=1) data = pd.melt (data, id_vars='blueWins', var_name='Features', value_name='Values') plt.figure (figsize= (10Magne6)) sns.swarmplot Data=data) plt.xticks (rotation=45) plt.show ()

#% data = pd.concat ([y, x], axis=1) .sample (500) sns.pairplot (data, vars= ['blueKills','blueDeaths','blueAssists','killsDiff','assistsDiff','redAssists'], hue='blueWins') plt.show ()

#% some features in pairwise combination can improve the division of data x ['dragonsDiff'] = x [' blueDragons']-x ['redDragons'] # get the dragon x [' heraldsDiff'] = x ['blueHeralds']-x [' redHeralds'] # get the canyon pioneer x ['eliteDiff'] = x [' blueEliteMonsters']-x ['redEliteMonsters'] # kill the wild monster data = pd.concat ([y, x] Axis=1) eliteGroup = data.groupby (['eliteDiff']) [' blueWins'] .mean () dragonGroup = data.groupby (['dragonsDiff']) [' blueWins'] .mean () heraldGroup = data.groupby (['heraldsDiff']) [' blueWins'] .mean () fig, ax= plt.subplots (1) eliteGroup.plot (kind='bar', ax=ax [0]) dragonGroup.plot (kind='bar', ax=ax [1]) heraldGroup.plot (kind='bar') Ax=ax [2]) print (eliteGroup) print (dragonGroup) print (heraldGroup) plt.show ()

#% number of tweets and game winning or losing x ['towerDiff'] = x [' blueTowersDestroyed']-x ['redTowersDestroyed'] data = pd.concat ([y, x], axis=1) towerGroup = data.groupby ([' towerDiff']) ['blueWins'] print (towerGroup.count ()) print (towerGroup.mean () fig, ax = plt.subplots Ax=ax [0]) ax [0] .set _ title ('Proportion of Blue Wins') ax [0] .set _ ylabel (' Proportion') towerGroup.count (). Plot (kind='line', ax=ax [1]) ax [1] .set _ title ('Count of Towers Destroyed') ax [1] .set _ ylabel (' Count')

#% use LightGBM for training and forecasting # # in order to correctly evaluate the performance of the model, divide the data into a training set and a test set, train the model on the training set, and verify the performance of the model on the test set. From sklearn.model_selection import train_test_split## selects samples whose categories are 0 and 1 (excluding samples of category 2) data_target_part = ydata_features_part = Xerox # Test set size is 20% 80% LightGBM 20% x_train, x_test, y_train, y_test = train_test_split (data_features_part, data_target_part, test_size = 0.2, random_state = 2020) #% # # Import LightGBM model from lightgbm.sklearn import LGBMClassifier## definition LightGBM model clf = LGBMClassifier () # train LightGBM model clf.fit (x_train) on the training set Y_train) #% use the trained model to predict on the training set and test set respectively train_predict = clf.predict (x_train) test_predict = clf.predict (x_test) from sklearn import metrics## use accuracy (accuracy) [the ratio of the correct number of samples to the total number of predicted samples] to evaluate the effectiveness of the model print ('The accuracy of the LightGBM is:',metrics.accuracy_score (y_train) Train_predict)) print ('The accuracy of the LightGBM is:',metrics.accuracy_score (yearly testprecinct)) # # View the confusion matrix (statistical matrix of various cases of predicted and real values) confusion_matrix_result = metrics.confusion_matrix (test_predict,y_test) print (' The confusion matrix result:\ nconfusionconfusion matrixresult) # visualize the results using thermal maps (figsize= (8) 6) sns.heatmap (confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel ('Predicted labels') plt.ylabel (' True labels') plt.show ()

#% use lightgbm for feature selection, and you can also use the attribute feature_importances_ to view the importance of a feature sns.barplot (y=data_features_part.columns, ximportclf. Features imports _)

#% except feature_importances_ Other attributes in LightGBM can also be used to evaluate (gain,split) from sklearn.metrics import accuracy_scorefrom lightgbm import plot_importancedef estimate (model,data): ax1=plot_importance (model,importance_type= "gain") ax1.set_title ('gain') ax2=plot_importance (model,importance_type= "split") ax2.set_title (' split') plt.show () def classes (data,label,test): model=LGBMClassifier () model.fit (data Label) ans=model.predict (test) estimate (model, data) return ans ans=classes

Better results by adjusting parameters: important parameters in LightGBM

Learning_rate: sometimes called eta, the system default is 0. 3. The step size of each iteration is very important. Too big, low accuracy, too small, slow speed.

Num_leaves: the system defaults to 32. This parameter controls the maximum number of leaf nodes in each tree.

Feature_fraction: the system default is 1. We usually set it to about 0.8. Used to control the proportion of the number of columns randomly sampled per tree (each column is a feature).

Max_depth: the system default is 6, and we often use numbers between 3 and 10. This value is the maximum depth of the tree. This value is used to control over-fitting. The larger the max_depth, the more specific the model is learned.

#% adjust parameters Get better results # # Import grid tuning function from sklearn.model_selection import GridSearchCV## from sklearn library to define parameter range learning_rate = [0.1,0.3,0.6] feature_fraction = [0.5,0.8,1] num_leaves = [16,32,64] max_depth = [- 1 learning_rate': learning_rate, 'feature_fraction':feature_fraction] parameters 'num_leaves': num_leaves,' max_depth': max_depth} model = LGBMClassifier (n_estimators = 50) # # perform grid search clf = GridSearchCV (model, parameters, cv=3, scoring='accuracy',verbose=3, n_jobs=-1) clf = clf.fit (x_train, y_train) #% to see what the best parameter values are print (clf.best_params_)

#% what is the best parameter value to see? print (clf.best_params_) #% distributed on the training set and test set to predict using the best model parameters # # define the LightGBM model with parameters clf = LGBMClassifier (feature_fraction = 1, learning_rate = 0.1, max_depth= 3 Num_leaves = 16) # training the LightGBM model on the training set clf.fit (x_train, y_train) train_predict = clf.predict (x_train) test_predict = clf.predict (x_test) # # using accuracy (accuracy) [the ratio of the correct number of samples to the total number of predicted samples] to evaluate the model effect print ('The accuracy of the LightGBM is:',metrics.accuracy_score (y_train) Train_predict)) print ('The accuracy of the LightGBM is:',metrics.accuracy_score (yearly testprecinct)) # # View the confusion matrix (statistical matrix of various cases of predicted and real values) confusion_matrix_result = metrics.confusion_matrix (test_predict,y_test) print (' The confusion matrix result:\ nconfusionconfusion matrixresult) # visualize the results using thermal maps (figsize= (8) 6) sns.heatmap (confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel ('Predicted labels') plt.ylabel (' True labels') plt.show ()

III. Adjustment of important parameters and basic parameters of KeysLightGBM

Num_leaves parameter this is the main parameter to control the complexity of the tree model. Generally speaking, we will make num_leaves less than (to the max_depth power of 2) to prevent overfitting. Because LightGBM is a leaf-wise tree which is different from XGBoost's depth-wise tree building method, num_leaves plays a more important role than depth.

Min_data_in_leaf this is a very important parameter in dealing with the fitting problem. Its value depends on the sample tree and num_leaves parameters of the training data. Setting it large avoids generating an overly deep tree, but may lead to underfitting. In practical application, for big data set, it is enough to set it to hundreds or thousands.

The depth of the max_depth tree, the concept of depth is not very useful in the leaf-wise tree, because there is not a reasonable mapping from leaves to depth.

Parameter adjustment for training speed

Use the bagging method by setting the bagging_fraction and bagging_freq parameters.

Use subsampling of features by setting the feature_fraction parameter.

Select the smaller max_bin parameter. Use save_binary to speed up data loading in the future learning process.

Parameter adjustment for accuracy

Use a larger max_bin (learning speed may be slower)

Use a smaller learning_rate and a larger num_iterations

Use a larger num_leaves (may result in overfitting)

Use larger training data

Try dart mode

Parameter adjustment for overfitting

Use a smaller max_bin

Use a smaller num_leaves

Using min_data_in_leaf and min_sum_hessian_in_leaf

Use bagging by setting up bagging_fraction and bagging_freq

Use feature subsampling by setting feature_fraction

Use larger training data

Use lambda_l1, lambda_l2, and min_gain_to_split to use rules

Try max_depth to avoid generating too deep a tree

The above is all the content of the article "sample Analysis of LightGBM in Python". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report