Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the scenarios of machine learning?

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article will explain in detail about the use of machine learning scenarios, the editor thinks it is very practical, so share it with you as a reference, I hope you can get something after reading this article.

Algorithm classification supervised learning:

In supervised learning, the input data is called "training data", and each group of training data has a clear identification or result, such as "spam" and "non-spam" in the anti-spam system. "1", "2", "3", "4" and so on. When establishing the prediction model, supervised learning establishes a learning process, compares the prediction results with the actual results of "training data", and constantly adjusts the prediction model until the prediction results of the model reach an expected accuracy. Common application scenarios of supervised learning, such as classification problems and regression problems. The common algorithms are logical regression (Logistic Regression) and back propagation neural network (Back Propagation Neural Network).

Unsupervised learning:

In unsupervised learning, the data is not specifically identified, and the learning model is to infer some internal structure of the data. The unsupervised learning model is used to find hidden patterns or relationships from the original data (no training data), so the unsupervised learning model is based on unlabeled data sets. Common application scenarios include learning association rules and clustering. Common algorithms include Apriori algorithm and k-Means algorithm. Examples: social networks, language prediction

Semi-supervised learning:

In this learning mode, part of the input data is identified and part of the input data is not identified. This learning model can be used for prediction, but the model first needs to learn the internal structure of the data in order to reasonably organize the data for prediction. The application scenario includes classification and regression, and the algorithm includes some extensions of common supervised learning algorithms. These algorithms first try to model the unidentified data, and then predict the identified data. Such as graph theory reasoning algorithm (Graph Inference) or Laplace support vector machine (Laplacian SVM.) and so on. Examples: image classification, speech recognition

Reinforcement learning:

In this learning mode, the input data is used as the feedback to the model, unlike the supervised model, the input data is only used as a way to check whether the model is right or wrong. Under reinforcement learning, the input data is fed back directly to the model. The model must adjust this immediately. The reinforcement learning model seeks to maximize the target return function through different behaviors. Common application scenarios include dynamic systems and robot control, artificial intelligence AI and so on. Common algorithms include Q-Learning and time difference learning (Temporal difference learning).

In the scenario of enterprise data application, supervised learning and unsupervised learning are probably the most commonly used models. In the field of image recognition, due to the existence of a large number of unidentified data and a small amount of identifiable data, semi-supervised learning is a hot topic. Reinforcement learning is more widely used in robot control and other fields that need system control.

Algorithm similarity

According to the similarity of the function and form of the algorithm, we can classify the algorithm, such as tree-based algorithm, neural network-based algorithm and so on. Of course, the scope of machine learning is so large that some algorithms are difficult to categorize into a certain category. For some classifications, the algorithm of the same classification can be aimed at different types of problems. Here, we try to classify the commonly used algorithms in the easiest way to understand.

Regression algorithm:

Regression algorithm is a kind of algorithm that tries to explore the relationship between variables by measuring the error. Regression algorithm is a powerful tool for statistical machine learning. In the field of machine learning, people talk about regression, sometimes it refers to a kind of problem, sometimes it refers to a kind of algorithm, which often puzzles beginners. Common regression algorithms include least squares (Ordinary Least Square), logical regression (Logistic Regression), stepwise regression (Stepwise Regression), multivariate adaptive regression splines (Multivariate Adaptive Regression Splines) and local scatter smoothing estimation (Locally Estimated Scatterplot Smoothing).

Case-based algorithm

Case-based algorithms are often used to model decision-making problems. Such models often select a batch of sample data first, and then compare the new data with the sample data according to some approximations. Find the best match in this way. Therefore, case-based algorithms are often called "winner-takes-all" learning or "memory-based learning". Common algorithms include k-Nearest Neighbor (KNN), learning vector quantization (Learning Vector Quantization, LVQ), and self-organizing mapping algorithms (Self-Organizing Map, SOM)

Regularization method

Regularization method is an extension of other algorithms (usually regression algorithm), which is adjusted according to the complexity of the algorithm. Regularization methods usually reward simple models and punish complex algorithms. Common algorithms include: Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), and resilient network (Elastic Net).

Decision tree learning

The decision tree algorithm uses a tree structure to build a decision model according to the attributes of the data, and the decision tree model is often used to solve classification and regression problems. Common algorithms include classification and regression trees (Classification And Regression Tree, CART), ID3 (Iterative Dichotomiser 3), C4.5, Chi-squared Automatic Interaction Detection (CHAID), Decision Stump, random forest (Random Forest), multivariate adaptive regression splines (MARS) and gradient thruster (Gradient Boosting Machine, GBM).

Bayesian method

Bayesian algorithm is a kind of algorithm based on Bayesian theorem, which is mainly used to solve classification and regression problems. Common algorithms include: naive Bayesian algorithm, average single dependency estimation (Averaged One-Dependence Estimators, AODE), and Bayesian Belief Network (BBN).

Kernel-based algorithm

The most famous kernel-based algorithm is support vector machine (SVM). The kernel-based algorithm maps the input data to a high-order vector space, where some classification or regression problems can be solved more easily. Common kernel-based algorithms include support vector machines (Support Vector Machine, SVM), radial basis functions (Radial Basis Function, RBF), and linear discriminant analysis (Linear Discriminate Analysis, LDA).

Clustering algorithm

Clustering, like regression, sometimes people describe a class of problems, sometimes they describe a class of algorithms. Clustering algorithms usually merge the input data according to the central point or hierarchical way. All clustering algorithms try to find the internal structure of the data in order to classify the data according to the greatest common ground. Common clustering algorithms include k-Means algorithm and expectation maximization algorithm (Expectation Maximization, EM).

Association rule learning

Association rule learning finds out a large number of useful association rules in multivariate data sets by finding the rules that can best explain the relationship between data variables. Common algorithms include Apriori algorithm and Eclat algorithm.

Artificial neural network

Artificial neural network algorithm simulates biological neural network and is a kind of pattern matching algorithm. It is usually used to solve classification and regression problems. Artificial neural network is a huge branch of machine learning, there are hundreds of different algorithms. (deep learning is one of these algorithms, which we will discuss separately.) important artificial neural network algorithms include perceptron neural network (Perceptron Neural Network), back propagation (Back Propagation), Hopfield network, and self-organizing mapping (Self-Organizing Map, SOM). Learning vector quantization (Learning Vector Quantization, LVQ)

Deep learning

Deep learning algorithm is the development of artificial neural network. Recently, it has won a lot of attention, especially after Baidu began to make efforts to learn deeply, it has attracted a lot of attention in China. At a time when computing power is becoming increasingly cheap, deep learning tries to build much larger and more complex neural networks. Many deep learning algorithms are semi-supervised learning algorithms, which are used to deal with big data sets with a small amount of unidentified data. Common deep learning algorithms include: restricted Boltzmann machine (Restricted Boltzmann Machine, RBN), Deep Belief Networks (DBN), convolutional network (Convolutional Network), stack automatic encoder (Stacked Auto-encoders).

Dimension reduction algorithm

Like the clustering algorithm, the dimensionality reduction algorithm attempts to analyze the internal structure of the data, but the dimensionality reduction algorithm attempts to use less information to summarize or interpret the data in an unsupervised way. Such algorithms can be used for the visualization of high-dimensional data or to simplify data for supervised learning. Common algorithms include principal component analysis (Principle Component Analysis, PCA), partial least squares regression (Partial Least Square Regression,PLS), Sammon mapping, multi-dimensional scale (Multi-Dimensional Scaling, MDS), projection pursuit (Projection Pursuit) and so on.

Integration algorithm:

The ensemble algorithm uses some relatively weak learning models to train the same samples independently, and then integrates the results for overall prediction. The main difficulty of the integration algorithm lies in which independent and weak learning models are integrated and how to integrate the learning results. This is a very powerful algorithm, and it is also very popular. Common algorithms include: Boosting, Bootstrapped Aggregation (Bagging), AdaBoost, stack generalization (Stacked Generalization, Blending), gradient thruster (Gradient Boosting Machine, GBM), random forest (Random Forest).

Comparison of 8 algorithms of Machine Learning

There are too many machine learning algorithms, such as classification, regression, clustering, recommendation, image recognition and so on, so it is not easy to find a suitable algorithm, so in practical application, we generally use heuristic learning to experiment.

Usually at the beginning, we will choose a generally accepted algorithm, such as SVM,GBDT,Adaboost, now deep learning is very hot, neural network is also a good choice.

If you care about accuracy (accuracy), the best way is to test each algorithm one by one through cross-validation, compare it, then adjust the parameters to ensure that each algorithm reaches the optimal solution, and finally choose the best one.

But if you are just looking for a "good enough" algorithm to solve your problem, or here are some tips to refer to, below to analyze the advantages and disadvantages of each algorithm, based on the advantages and disadvantages of the algorithm, it is easier for us to choose it.

Deviation & variance

In statistics, the quality of a model is measured by deviation and variance, so let's first popularize deviation and variance:

Deviation: describes the gap between the expected E'of the predicted value (estimated value) and the true value Y. The greater the deviation, the more deviation from the real data.

Variance: describes the variation range of the predicted value P, the degree of dispersion, is the variance of the predicted value, that is, the distance from the expected value E. The greater the variance, the more dispersed the distribution of data.

The real error of the model is the sum of the two, as shown in the following figure:

If it is a small training set, a high deviation / low variance classifier (for example, naive Bayesian NB) has a greater advantage than a low deviation / high variance large classification (for example, KNN), because the latter will overfit.

However, with the growth of your training set, the better the prediction ability of the model to the original data, the lower the deviation, and the low deviation / high variance classifier will gradually show its advantage (because they have lower asymptotic error). At this time, the high deviation classifier is no longer enough to provide an accurate model.

Of course, you can also think of this as a difference between the generated model (NB) and the discriminant model (KNN).

Why does naive Bayes have high deviation and low variance?

First of all, suppose you know the relationship between the training set and the test set. To put it simply, we have to learn a model on the training set, and then take it to the test set to use it. The effect should be measured according to the error rate of the test set.

But most of the time, we can only assume that the test set and the training set conform to the same data distribution, but we can't get the real test data. At this time, how to measure the test error rate when only seeing the training error rate?

Because there are few training samples (at least not enough), the model obtained from the training set is not really correct. (even if the correct rate on the training set is 100%, it does not mean that it depicts the real data distribution, knowing that depicting the real data distribution is our goal, rather than just depicting the limited data points of the training set.)

Moreover, in practice, the training samples often have certain noise errors, so if we pursue too much perfection in the training set and adopt a very complex model, it will make the model regard the errors in the training set as the real data distribution characteristics, so as to get the wrong data distribution estimation.

In this way, when it comes to the real test set, it will be a mess (this phenomenon is called over-fitting). However, we can not use a too simple model, otherwise, when the data distribution is more complex, the model will not be enough to describe the data distribution (reflected in that even the error rate on the training set is very high, this phenomenon is less fitting).

Over-fitting shows that the model used is more complex than the real data distribution, while the model used in the underfitting representation is simpler than the real data distribution.

Under the framework of statistical learning, when we describe the complexity of the model, there is a point of view that Error = Bias + Variance. The Error here can be understood as the prediction error rate of the model, which is composed of two parts, one is the inaccurate estimation (Bias) caused by the simplicity of the model, and the other is the larger change space and uncertainty (Variance) caused by the complexity of the model.

Therefore, it is easy to analyze naive Bayes. It simply assumes that the data are independent and is a severely simplified model. Therefore, for such a simple model, in most cases, the Bias part is larger than the Variance part, that is, high deviation and low variance.

In practice, in order to keep Error as small as possible, we need to balance the proportion of Bias and Variance, that is, over-fitting and under-fitting, when choosing a model.

The relationship between deviation and variance and model complexity is clearer by using the following figure:

When the complexity of the model increases, the deviation will gradually become smaller, and the variance will gradually increase.

Advantages and disadvantages of common algorithms

1. Naive Bayes

Naive Bayes is a generative model (about generating models and discriminant models, it mainly depends on whether joint distribution is required). It's very simple, you just do a bunch of counts.

If there is conditional independence hypothesis (a relatively strict condition), the convergence speed of naive Bayesian classifier will be faster than the discriminant model, such as logical regression, so you only need less training data. Even if the NB conditional independence hypothesis is not true, the NB classifier still performs very well in practice.

Its main disadvantage is that it can not learn the interaction between features. In terms of R in mRMR, it is feature redundancy. To quote a more classic example, for example, although you like Brad Pitt and Tom Cruise movies, it can't learn movies that you don't like when they act together.

Advantages:

Naive Bayesian model originates from classical mathematical theory and has a solid mathematical foundation and stable classification efficiency.

Good performance for small-scale data, able to handle multi-classification tasks, suitable for incremental training

It is not sensitive to missing data, and the algorithm is relatively simple, so it is often used in text classification.

Disadvantages:

A priori probability needs to be calculated

There is an error rate in classification decision

It is sensitive to the expression of input data.

two。 Logical regression

It's a discriminant model, there are many ways to regularize the model (L0, L1, L2, etc), and you don't have to worry as much about whether your features are relevant as you do with naive Bayes.

Compared with decision trees and SVMs, you will also get a good probability explanation, and you can even easily update the model with new data (using online gradient descent algorithm, online gradient descent).

If you need a probability architecture (for example, simply adjusting the classification threshold, indicating uncertainty, or obtaining a confidence interval), or if you want to quickly integrate more training data into the model in the future, use it.

Sigmoid function:

Advantages:

Easy to implement and widely used in industrial problems

When sorting, the amount of calculation is very small, the speed is very fast, and the storage resources are low.

Convenient probability score of observation sample

For logical regression, multicollinearity is not a problem, but it can be solved by combining L2 regularization.

Disadvantages:

When the feature space is very large, the performance of logical regression is not very good.

Easy to underfit, the general accuracy is not too high

Can not handle a large number of multi-class features or variables well

Can only deal with the problem of two classifications (the softmax derived from this can be used for multiple classifications) and must be linearly separable.

For nonlinear features, conversion is needed.

3. Linear regression.

Linear regression is used for regression, unlike Logistic regression for classification, its basic idea is to use gradient descent method to optimize the error function in the form of least square method. Of course, the solution of parameters can also be obtained directly by normal equation.

In LWLR (locally weighted linear regression), the calculation expression of the parameter is:

Thus it can be seen that LWLR is different from LR, LWLR is a nonparametric model, because each regression calculation has to traverse the training sample at least once.

Advantages:

Easy to implement and simple to calculate.

Disadvantages:

The nonlinear data can not be fitted.

4. Nearest neighbor algorithm-KNN

KNN is the nearest neighbor algorithm, and its main process is:

1. Calculate the distance of each sample point in the training sample and the test sample (common distance measures are Euclidean distance, Mahalanobis distance, etc.)

two。 Sort all the distance values above

3. K samples with minimum distance before election

4. Vote according to the labels of the k samples to get the final classification.

How to choose the best K value depends on the data. In general, a larger K value in classification can reduce the impact of noise. But it blurs the boundaries between categories.

A better K value can be obtained by various heuristic techniques, such as cross-validation. In addition, the existence of noise and uncorrelated feature vectors will reduce the accuracy of the K-nearest neighbor algorithm.

The nearest neighbor algorithm has strong consistent results. As the data tends to infinity, the algorithm ensures that the error rate will not exceed twice that of the Bayesian algorithm. For some good K values, K nearest neighbors guarantee that the error rate will not exceed the error rate of Bayesian theory.

Advantages:

The theory is mature and the thought is simple. It can be used for both classification and regression.

It can be used for nonlinear classification.

The training time complexity is O (n).

No assumptions about data, high accuracy, insensitive to outlier

Disadvantages:

Large amount of calculation

Sample imbalance (that is, some categories have a large number of samples, while others have a small number of samples)

Need a lot of memory

5. Decision tree

It's easy to explain. It can handle the interaction between features without pressure and is non-parametric, so you don't have to worry about outliers or whether the data is linearly separable (for example, the decision tree can easily handle the situation where category An is at the end of a feature dimension x, category B is in the middle, and then category An appears in front of feature dimension x).

One of its disadvantages is that it does not support online learning, so after the arrival of new samples, the decision tree needs to be completely rebuilt.

Another disadvantage is that it is easy to over-fit, but this is the entry point for integration methods such as random forest RF (or lifting tree boosted tree).

In addition, random forest is often the winner of many classification problems (usually slightly better than support vector machine), it is fast and adjustable, and you don't have to worry about adjusting a lot of parameters like support vector machine. so it's always been popular in the past.

An important point in the decision tree is to select an attribute for branching, so pay attention to the formula for calculating the information gain and understand it in depth.

The formula for calculating information entropy is as follows:

Where n means there are n categories (for example, if it is a category 2 problem, then n is 2). The probabilities p1 and p2 of these two types of samples in the total samples are calculated respectively, so that the information entropy before the unselected attribute branch can be calculated.

Now select an attribute xixi for branching, and the branching rule is: if xi=vxi=v, divide the sample into one branch of the tree; if not, enter another branch.

Obviously, the sample in the branch is likely to include two categories. Calculate the entropy H _ 1 and H _ 2 of the two branches respectively, and calculate the total information entropy H'= p _ 1 H1+p2 H _ 2 after the branch, then the information gain Δ H = H-H'. According to the principle of information gain, all the attributes are tested, and an attribute that maximizes the gain is selected as this branch attribute.

Advantages

The calculation is simple, easy to understand and strong in explanation.

It is more suitable to deal with samples with missing attributes.

Ability to handle irrelevant features

It can make feasible and effective results for large data sources in a relatively short time.

Shortcoming

Easy to over-fit (random forest can greatly reduce over-fitting)

Ignore the correlation between the data

For those data with different sample sizes, in the decision tree, the result of information gain tends to those features with more numerical values (as long as the information gain is used, it has this disadvantage, such as RF).

5.1 Adaboosting

Adaboost is a kind of additive model, each model is based on the error rate of the previous model, pay too much attention to the wrong samples, and reduce the attention to the correctly classified samples, after successive iterations, we can get a relatively better model. It is a typical boosting algorithm. The following is a summary of its advantages and disadvantages.

Advantages

Adaboost is a classifier with high precision.

You can use a variety of methods to build sub-classifiers, and the Adaboost algorithm provides a framework.

When using a simple classifier, the calculated results are understandable, and the construction of the weak classifier is extremely simple.

Simple, no need to do feature screening.

Overfitting is not easy to occur.

For random forest and GBDT combinatorial algorithms, refer to this article: summary of machine learning-combinatorial algorithms.

Disadvantages:

Sensitive to outlier

6.SVM support vector machine

High accuracy provides a good theoretical guarantee to avoid over-fitting, and even if the data are linearly inseparable in the original feature space, as long as a suitable kernel function is given, it can run well.

It is especially popular in the problem of ultra-high-dimensional text classification. Unfortunately, it is difficult to explain because of the large memory consumption, and it is also annoying to run and adjust parameters, but the random forest just avoids these shortcomings and is more practical.

Advantages

Can solve the high-dimensional problem, that is, large feature space.

Ability to deal with the interaction of nonlinear features

No need to rely on the entire data

It can improve the generalization ability.

Shortcoming

When there are many observation samples, the efficiency is not very high.

There is no general solution to nonlinear problems, and sometimes it is difficult to find a suitable kernel function.

Sensitive to missing data

There are also tricks for kernel selection (libsvm comes with four kernel functions: linear kernel, polynomial kernel, RBF kernel, and sigmoid kernel):

First, if the number of samples is less than the number of features, then there is no need to choose a nonlinear kernel, just use a linear kernel.

Second, if the number of samples is greater than the number of features, nonlinear kernels can be used to map the samples to higher dimensions, and generally better results can be obtained.

Third, if the number of samples is equal to the number of features, the nonlinear kernel can be used in this case, and the principle is the same as the second one.

For the first case, we can also reduce the dimension of the data first, and then use nonlinear kernels, which is also a method.

7. Advantages and disadvantages of artificial neural network:

The classification has high accuracy.

Strong parallel distributed processing ability, distributed storage and learning ability

It has strong robustness and fault tolerance to noise nerve, and can fully approach complex nonlinear relations.

Have the function of associative memory.

Disadvantages:

Neural networks require a large number of parameters, such as network topology, weights and initial values of thresholds.

The learning process can not be observed, and the output is difficult to explain, which will affect the credibility and acceptability of the results.

Study time is too long, and may not even achieve the purpose of learning.

This is the end of this article on "what are the scenarios of machine learning?". I hope the above content can be helpful to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report