Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of python data Mining algorithm

2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article shares with you the content of an example analysis of python data mining algorithms. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Briefly describe the first step in the process of data mining: data selection.

It can be obtained through business raw data, public data sets, or crawlers.

Step 2: data preprocessing

The data is very likely to have noise, incomplete and other defects, it is necessary to standardize the data, the methods are min-max standardization, z-score standardization, modified standard z-score.

Step 3: eigenvalue data conversion

The features of the data are extracted to make the data conform to the analysis model of a specific data mining algorithm. There are many data models, which will be explained in detail later.

Step 4: model training

Choose a good data mining algorithm to train the data

Step 5: test model + effect evaluation

There are two mainstream approaches:

Ten-fold cross-validation: the data set is randomly divided into ten equal parts, each time using 9 data as training set and 1 data as test set, so as to iterate 10 times. The key to 10% discount cross-validation is to divide 10 parts more evenly.

N-fold cross-validation is also known as leaving one method: training with almost all the data, then leaving a data for testing, and iterating over each data test. The advantage of leaving one method is: certainty.

Step 6: model usage

The trained model is used to predict the data.

Step 7: explanation and evaluation

The information after data mining is analyzed and explained, and applied to the actual field of work.

2. Explain the main algorithm model-- based on sklearn.

1) Linear regression: it is hoped that all points fall on a straight line and that all points are closest to the straight line. First assume the values of an and b in y=ax+b, and then calculate the sum of the distances from each data point to the line in order to minimize the sum!

From sklearn.linear_model import LinearRegression# defines the linear regression model model = LinearRegression (fit_intercept=True, normalize=False, copy_X=True, n_jobs=1) "" parameter-fit_intercept: whether to calculate the intercept. The False- model has no intercept normalize: when fit_intercept is set to False, this parameter is ignored. If true, the pre-regression coefficient X is normalized by subtracting the average and dividing by the L2-norm. N_jobs: specify the number of threads ""

2) logical regression: dichotomy algorithm, which is used for two classification problems. You need to predict the "approximate form" of the function, such as linear or non-linear.

As mentioned above, the dataset requires a linear boundary. Different data requires different boundaries.

From sklearn.linear_model import LogisticRegression# defines a logical regression model model = LogisticRegression (penalty='l2', dual=False, tol=0.0001, Crun1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='liblinear', max_iter=100, multi_class='ovr', verbose=0, warm_start=False) N_jobs=1) "" parameter-penalty: use the specified regularization term (default: L2) dual: n_samples > n_features to take False (default) C: inverse of regularization intensity The smaller the value, the greater the regularization intensity n_jobs: specify the number of threads random_state: random number generator fit_intercept: whether a constant "" is required

3) naive Bayesian algorithm NB: used to judge the probability of something happening, I have used this algorithm as a public opinion classifier. Change some sentences into 01 two-dimensional matrix, calculate the occurrence frequency of words, and judge what the emotional color of the sentence is.

It is very efficient, but there is a certain probability of error.

From sklearn import naive_bayesmodel = naive_bayes.GaussianNB () # Gaussian Bayesian model = naive_bayes.MultinomialNB (alpha=1.0, fit_prior=True, class_prior=None) model = naive_bayes.BernoulliNB (alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None) "" MultinomialNB parameter commonly used in text classification problems-alpha: smoothing parameter fit_prior: whether to learn the prior probability of the class False- uses a uniform prior probability class_prior: whether to specify the prior probability of the class; if specified, the threshold of binarize: binarization cannot be adjusted according to the parameter. If None, it is assumed that the input is composed of binary vectors.

4) decision tree DT: a flowchart-like tree structure that uses a branching method to illustrate each possible outcome of a decision. Each node in the tree represents a test for a specific variable-each branch is the result of that test.

From sklearn import tree model = tree.DecisionTreeClassifier (criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False) "" Parameter-criterion: feature selection criteria gini/entropy max_depth: maximum depth of the tree None- as lower as possible min_samples_split: split the internal node, the minimum sample tree required min_samples_leaf: the minimum number of samples required by the leaf node max_features: the maximum number of features when finding the optimal segmentation point max_leaf_nodes: give priority to the maximum number of leaf nodes min_impurity_decrease: if this separation leads to a reduction of impurities greater than or equal to this value The node will be split. "

5) support vector machine SVM: it is to judge whether the linear can be divided into two kinds of data or not. The theory can be extended to three-dimensional or even the feature space above thinking. Three-dimensional use of plane to separate data, four-dimensional and four-dimensional more than because human beings can not intuitively perceive, so can not draw, but can separate data, the existence of such a plane is called hyperplane.

From sklearn.svm import SVCmodel = SVC (If gamma is 1.0, kernel='rbf', gamma='auto') "" parameter-C: penalty parameter of the error term C gamma: kernel correlation coefficient. Floating point, If gamma is' auto' then 1/n_features will be used instead. ""

6) k nearest neighbor algorithm KNN: an algorithm that classifies data by measuring the distance between different eigenvalues.

Given a set of samples, this is called a training set, and each data in the sample contains tags. For a newly entered data that does not contain a tag, the first k is selected by calculating the distance between the new data and each sample, usually k less than 20. The tag with the most frequent occurrence of the most recent data in k plays is used as the newly added data tag.

K-nearest neighbor algorithm, that is, given a training data set, for a new input instance, find the K instances nearest to the instance in the training data set, and most of these K instances belong to a certain class, so classify the input instance into this class. (this is similar to the idea that the minority is subordinate to the majority in real life.) according to this statement, let's take a look at a picture quoted from Wikipedia:

If the three nearest points of the green dot are two red triangles and one blue square, and the minority is subordinate to the majority, based on the statistical method, the green dot to be classified belongs to the red triangle.

If the five nearest neighbors of the green dot are two red triangles and three blue squares, or the minority is subordinate to the majority, based on the statistical method, the green dot to be classified belongs to the blue square category.

From sklearn import neighbors# defines kNN classification model model = neighbors.KNeighborsClassifier (n_neighbors=5, n_jobs=1) # classification model = neighbors.KNeighborsRegressor (n_neighbors=5, n_jobs=1) # regression "" parameter-n_neighbors: number of neighbors used n_jobs: number of parallel tasks ""

7) K-means clustering (K-means):

Define the number of target clusters K, for example, karm3

Random initialization of k clustering centers (controids)

Calculate the Euclidean Distance of each data point to K cluster centers, and then divide the data points into the category of the corresponding cluster center with the smallest Euclidean Distance

Recalculate its cluster center for each category

Repeat the above 3-4 steps until a termination condition is reached (number of iterations, minimum error change, etc.)

Import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeans df = pd.DataFrame ({"x": [25, 34, 22, 27, 33, 33, 31, 22, 35, 34, 67, 54, 57, 43, 50, 57, 59, 52, 65, 47, 49, 48, 35, 33, 44, 45, 38, 43, 51, 46], "y": [79, 51, 53, 78, 59, 74, 73, 57, 69, 75, 51, 32, 40, 47 53, 36, 35, 59, 59, 50, 25, 20, 14, 12, 20, 5, 29, 27, 8, 7]}) kmeans = KMeans (n_clusters=3) .fit (df) centroids = kmeans.cluster_centers_# print clustering center print (type (centroids), centroids) # Visual clustering result fig, ax = plt.subplots () ax.scatter (df ['x'], df ['y'], c=kmeans.labels_.astype (float), slots 50 Alpha=0.5) ax.scatter (centroids [:, 0], centroids [:, 1], plt.show ()

Different from KNN, K-means clustering belongs to unsupervised learning.

Supervised learning knows what to learn from objects (data), while unsupervised learning does not need to know the target to be searched. it is a common feature of the data obtained according to the algorithm. For example, in terms of classification and clustering, classification knows the categories to be obtained in advance, while clustering is different, but objects are divided into different clusters on the basis of similarity.

Ps): we always encounter two kinds of problems in machine learning, one is regression problem, the other is classification problem. When we take it literally, it is easy to know that the classification problem is actually dividing our existing data into several categories, and then we divide the new data according to the classification, while the regression problem is to synthesize the existing data into a function and predict the new data according to the fitted function. The difference between the two is the type of output variable. Regression is quantitative output, or predicting continuous variables; classification problems are quantitative output, predicting discrete variables. Po, a picture I saw on Zhihu, explained very well:

3. Sklearn comes with the method joblib to save the trained model from sklearn.externals import joblib # Save the model joblib.dump (model, 'model.pickle') # load the model model = joblib.load (' model.pickle') Thank you for reading! This is the end of this article on "sample Analysis of python data Mining algorithms". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it out for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report