How to realize the sampling Classification method in python 07/09 Update SLTechnology News&Howtos

How to realize the sampling Classification method in python

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about how to achieve sampling classification in python. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

11.1 algorithm spot check

You don't know which algorithm works best for your data before practice. You need to try different algorithms to practice, and then know the next step. This is what I call algorithmic spot checking.

11.2 Overview of algorithms

Two linear algorithms

Logical regression

Nonlinear machine learning algorithm based on linear discriminant analysis

K nearest neighbor algorithm

Naive Bayes.

Classification and regression tree-CART is a kind of decision tree

Support vector machine

11.3 Linear machine learning algorithm

There is a question, what is linear and what is non-linear?

In fact, in ordinary books, it is not classified in this way. This may be to highlight the importance of linearity. The algorithm is divided into linear and nonlinear. Linear, such as logical regression, LDA.

11.3.1 logical regression

The name is regression, which is actually a method of classification.

Logical regression is a general linear regression with a sigmoid function, so the value is taken from the whole real number field to-1 to + 1, so the two categories are easy to understand, greater than zero and less than zero.

Logical regression requirements

Hypothetical Gaussian distribution

Input variable of number

Second, classification questions:

If you have time, you should review the important machine learning methods again. Main ideas, examples. We can have a separate series later.

# LogisticRegression Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [: 8] num_folds = 10kfold = KFold (n_splits=10, random_state=7) model = LogisticRegression () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # 0.7695146958311.3.2 Linear regression Analysis

LDA is a statistical technique for binary and multi-classification problems.

It also assumes that the parameter is a Gaussian distribution.

# LDA Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [: 8] num_folds = 10kfold = KFold (n_splits=10, random_state=7) model = LinearDiscriminantAnalysis () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # 0.773462064252

As you can see, scikit learn has done a lot of encapsulation for us, and the calling process is the same, without difficulty. The simplest one can be called and see the effect.

11.4 nonlinear algorithm 11.4.1 k neighbor

K-neighbor is a measure based on distance. Find the k nearest samples to a new sample, and then get the average as the predicted value. The achievement of k here may have a different effect.

As follows

# KNN Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.neighbors import KNeighborsClassifierfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [: 8] num_folds = 10kfold = KFold (n_splits=10, random_state=7) model = KNeighborsClassifier () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # 0.726555023923

First of all, 1) it does not specify how much k is, there should be a default value, api should be able to see. 2) K-neighbor should be time-consuming, but there is no response in today's practice because there are too few samples.

11.4.2 naive Bayes

Naive Bayesian is an algorithm based on Bayesian theory. It has an important assumption that each variable is distributed independently, that is, there is no correlation. Naive Bayes calculates the possibility of each parameter and the conditional probability of each category, and then estimates the new data and synthesizes the calculation. In this way, the estimation of the new sample is obtained.

Assuming that it is also a Gaussian distribution, the density function of the Gaussian distribution can be used.

It suddenly occurred to me that in practice, we can PCA to get the parameters that are orthogonal to each other, and then whether it is better to use naive Bayes?

# Gaussian Naive Bayes Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.naive_bayes import GaussianNBfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [: 8] kfold = KFold (n_splits=10, random_state=7) model = GaussianNB () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # 0.7551777170211.4.3 CART

There are several kinds of decision trees, such as CART, C4.5, the basic idea of the decision tree is to traverse all the features, classify the first feature, and then continue according to the second classification in each branch. Know that all the samples are of the same classification, or that the features are used up.

Here is a problem, how to choose the first classification, there are gini method, maximum entropy and other methods, choose the appropriate feature order, the construction of the decision tree is more efficient.

Its shortcomings, and then will be very sensitive to outliers, this should be careful.

Of course, the later random forest, as well as boosting and other practices are based on the decision tree to do, and played a good classification effect, this is later.

# CART Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.tree import DecisionTreeClassifierfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [:, 8] kfold = KFold (n_splits=10 Random_state=7) model = DecisionTreeClassifier () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # 0.69779562542711.4.4 support Vector Machine

Support vector machine, which I think is the most complex of the basic machine learning algorithms.

Its main idea is to find a tangent plane that can distinguish between different categories. -in fact, it is two categories.

So the problem is, if the plane can't be separated, it can be extended to multiple dimensions.

And if the dimension is too high, here is the opportunity to show the ability of the kernel function. It can solve the problem of too high dimension, too high dimension, or even infinite dimension is disadvantageous to the calculation.

Under the above idea, the purpose of SVM is to find the furthest line, or plane, from the samples on both sides.

For the singularity, it adds another c as fault tolerance.

For SVM, the basic idea may not be complicated, but the derivation process is not simple. For details, see jly's blog, which is, of course, based on the understanding of several awesome people. You can also write the following if you have a chance.

And then, for practice, there are a few points.

The choice of kernel function. Gauss, or something else?

Tuning of several parameters. This should also be discussed later.

# SVM Classificationfrom pandas import read_csvfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.svm import SVCfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age',' class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [:, 8] kfold = KFold (n_splits=10 Random_state=7) model = SVC () results = cross_val_score (model, X, Y, cv=kfold) print (results.mean ()) # # 0.65102529049911.5

This chapter learns how to use several classification algorithms for sampling. The next chapter is about return.

Cha 13 how to choose good algorithm 13.1 choose algorithm

This is easy to understand. Just compare the results below. Look at the code.

# Compare Algorithmsfrom pandas import read_csvfrom matplotlib import pyplotfrom sklearn.model_selection import KFoldfrom sklearn.model_selection import cross_val_scorefrom sklearn.linear_model import LogisticRegressionfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.naive_bayes import GaussianNBfrom sklearn.svm import SVC# load datasetfilename = 'pima-indians-diabetes.data.csv'names = [' preg', 'plas',' pres', 'skin',' test', 'mass',' pedi', 'age' 'class'] dataframe = read_csv (filename, names=names) array = dataframe.valuesX = array [:, 0:8] Y = array [:, 8] # prepare modelsmodels = [] models.append ((' LR', LogisticRegression ()) models.append (('LDA', LinearDiscriminantAnalysis ()) models.append ((' KNN', KNeighborsClassifier ()) models.append (('CART', DecisionTreeClassifier ()) models.append ((' NB', GaussianNB () models.append ('SVM') SVC ()) # evaluate each model in turnresults = [] names = [] scoring= 'accuracy'for name, model in models: kfold = KFold (n_splits=10, random_state=7) cv_results = cross_val_score (model, X, Y, cv=kfold, scoring=scoring) results.append (cv_results) names.append (name) msg = "% s:% f (% f)"% (name, cv_results.mean () Cv_results.std () print (msg) # boxplot algorithm comparisonfig = pyplot.figure () fig.suptitle ('Algorithm Comparison') ax = fig.add_subplot (111) pyplot.boxplot (results) ax.set_xticklabels (names) pyplot.show ()

Result

LR: 0.769515 (0.048411) LDA: 0.773462 (0.051592) KNN: 0.726555 (0.061821) CART: 0.691302 (0.069249) NB: 0.755178 (0.042766) SVM: 0.651025 (0.072141)

The above is how to achieve the sampling classification method in the python shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.