What are the basic knowledge points of SVM in machine learning 07/04 Update SLTechnology News&Howtos

What are the basic knowledge points of SVM in machine learning

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the SVM basic knowledge points of machine learning which are related knowledge, the content is detailed and easy to understand, the operation is simple and fast, and has a certain reference value. I believe you will gain something after reading this SVM basic knowledge point of machine learning. Let's take a look at it.

Linear classifier:

First of all, we give a very simple classification problem (linearly separable). We need to use a straight line to separate the black point from the white point in the following picture. Obviously, this line on the graph is one of the lines we require (there can be countless such lines).

Let's say that we make the black point =-1, the white point = + 1, and the straight line f (x) = w. X + b, where x and w are vectors, which is actually equivalent to f (x) = w1x1 + w2x2. + wnxn + b, when the dimension of the vector x = 2, f (x) represents a straight line in the two-dimensional space, when the dimension of x = 3, f (x) represents a plane in the three-dimensional space, and when the dimension of x = n > 3, it represents the NMuel 1-dimensional hyperplane in the n-dimensional space. These are relatively basic contents. If you are not clear, you may need to review the contents of calculus and linear algebra.

As I just said, we make the black and white points + 1 and-1 respectively, so when there is a new point x that needs to predict which category it belongs to, we can predict it with sgn (f (x)), sgn denotes the symbolic function, when f (x) > 0, sgn (f (x)) = + 1, when f (x)

< 0的时候sgn(f(x)) = -1。但是，我们怎样才能取得一个最优的划分直线f(x)呢？下图的直线表示几条可能的f(x) 一个很直观的感受是，让这条直线到给定样本中最近的点最远，这句话读起来比较拗口，下面给出几个图，来说明一下：第一种分法：

The second method:

Which of these two methods is better? Intuitively, the bigger the gap, the better, and the better to separate the points of the two categories. Just like we usually judge whether a person is male or female, it is very difficult to make a mistake, which is caused by the large gap between male and female categories, so that we can classify them more accurately. In SVM, it is called Maximum Marginal, which is one of the theoretical bases of SVM. There are many reasons for choosing the function with the largest gap as the segmentation plane, for example, from the point of view of probability, it is to maximize the confidence of the point with the least confidence (which sounds like a mouthful), from a practical point of view, this effect is very good, etc. Let's not talk about it here. As a conclusion, we will ok:)

The points in the image above that are coiled in red and blue are called support vectors (support vector).

The image above is a description of the gaps in the categories mentioned earlier. Classifier Boundary is f (x), the red and blue lines (plus plane and minus plane) are the faces where support vector is located, and the gap between the red and blue lines is the gap between the categories we want to maximize.

The formula of M is directly given here: (it can be easily obtained from the analytic geometry of high school, you can also refer to the ppt of Moore below)

In addition, the support vector is located on the line of wx + b = 1 and wx + b =-1, and we multiply the category y to which this point belongs (remember? y is either + 1 or-1), and the expression of the support vector is: y (wx + b) = 1, so that the support vector can be expressed more easily.

When the support vector is determined, the partition function is determined, and the two problems are equivalent. Another effect of getting the support vector is that the points behind the support vector do not have to participate in the calculation. This point will be discussed in more detail later.

At the end of this section, we give the expression we want to optimize the solution:

| | w | | it means the two-norm of w, which is the same as the denominator of the M expression above. I got before, M = 2 / | | w | |, maximizing this formula is equivalent to minimizing | | w |, in addition, since | w | | is a monotone function, we can add square to it, and the previous coefficient, which should be easily seen by familiar students. This formula is to facilitate derivation.

There are still some restrictions in this formula. If you write it down in its entirety, it should look like this: (original question)

S. T means subject to, which means under the latter restrictions, and this word is very easy to see in svm's paper. This is actually a constrained quadratic programming (quadratic programming, QP) problem, which is a convex problem. Convex problem means that there is no local optimal solution. You can imagine a funnel. No matter where we put a ball in the funnel, the ball will eventually fall out of the funnel, that is, get the global optimal solution. S.t. The latter constraint can be regarded as a convex polyhedron, and what we need to do is to find the optimal solution in this convex polyhedron. These questions will not be expanded here, because if we do, we will not be able to finish a book. If you have any questions, please check wikipedia.

Second, transform it into a dual problem and optimize the solution.

This optimization problem can be solved by Lagrangian multiplier method, using the theory of KKT condition, here we directly make the Lagrangian objective function of this formula:

The process of solving this formula requires knowledge of Lagrangian duality (in addition, pluskid also has an article on this problem), and there is a certain formula derivation, if you are not interested, you can skip to the conclusion expressed by the blue formula, which is mainly referred to from the article of plukids.

First of all, let L minimize the partial derivative of L with respect to wforme b, and make the partial derivative of L with respect to wforme b 0 respectively, and get an expression about the original problem.

The expression of the dual problem is obtained by bringing the two equations back to L (w _ rem _ b _ a).

The new problem plus its limitations are (dual problem):

This is the formula we need to optimize in the end. So far, the optimization formula of the linearly separable problem is obtained.

There are many methods to solve this equation, such as SMO and so on. I think that solving such a constrained convex optimization problem and getting this convex optimization problem are two relatively independent things, so in this article we are going to do nothing about how to solve this topic at all, if we have time later, we can add an article to talk about it:).

Third, the case of linear inseparability (soft interval):

Let's talk about the case of linear inseparability, because the assumption of linear separability is too limited:

The following figure is a typical linear inseparable classification diagram, we have no way to use a straight line to divide it into two regions, each region contains only one color point.

There are two ways to classify a classifier in this case. One is to separate it completely with a curve, which is a non-linear situation that has something to do with the kernel function that we will talk about later.

The other is to use a straight line, but there is no need to guarantee separability, that is, to accommodate those cases of misclassification, but we have to add a penalty function to make the situation as reasonable as possible. In fact, in many cases, it is not during training that the classification function is as perfect as possible, because some of the data in the training function are inherently noisy and may have been added incorrectly when we manually added the classification label. if we learn these wrong points during training (learning), then the model will inevitably make mistakes the next time we encounter these error situations (if the teacher gives you a lecture. If a certain knowledge point is wrong and you believe it, it is inevitable to make mistakes in the exam. The process of learning "noise" in this kind of learning is an over-fitting, which is a big taboo in machine learning. We would rather learn less than firmly put an end to some wrong knowledge. Back to the subject, how to divide linearly inseparable points with straight lines:

We can add a little penalty to the wrong point, and the penalty function for a wrong point is the distance from this point to its correct position:

In the figure above, the blue and red lines are the boundaries of the support vectors, the green lines are the decision functions, and the purple lines represent the distance from the wrong point to the corresponding decision surface. In this way, we can add a penalty function to the original function with the following restrictions:

The blue part of the formula is the penalty function added on the basis of the linearly separable problem. When xi is on the right side, ε = 0 ~ R is the number of all points, and C is a coefficient specified by the user, indicating how many penalties are added to the wrong points. When C is very large, there will be fewer wrong points, but the situation of over-fitting may be more serious, when C is very small. There may be many wrong points, but the resulting model may not be correct, so there is a lot of knowledge about how to choose C, but in most cases it is obtained through experience.

The next step is to solve a Lagrangian dual problem and get the expression of the dual problem of the original problem:

The blue part is different from the expression of the linearly separable dual problem. The difference of the dual problem obtained in the case of linear inseparability is that the range of α changes from [0, + ∞) to [0, C], and the increased penalty ε does not add any complexity to the dual problem.

Fourth, kernel function:

Just now in the case of inseparability, I mentioned that if we use some nonlinear methods, we can get a curve that perfectly divides the two categories, such as the kernel function that we will talk about next.

We can change the space from the original linear space to a higher dimensional space, and then divide it with a hyperplane under this high dimensional linear space. Here's an example to understand how to use the higher dimensions of space to help us classify (examples and pictures from the kernel function section of pluskid):

The following figure is a typical linear inseparable case.

But when we map these two elliptical points to a high-dimensional space, the mapping function is:

Using this function, the points in the plane of the above picture can be mapped to a three-dimensional space (Z1Magnez2Phenz3), and a linearly separable set of points can be obtained after rotating the mapped coordinates.

To use another philosophical example: there are no two identical objects in the world. For all two objects, we can add dimensions to make them finally different. For example, the two books may be the same from the two dimensions (color, content). We can add the dimension of the author, if not, we can add the page number, we can add the owner. You can add the purchase location, you can add notes, and so on. When the dimension is increased to infinite dimension, it must be possible for any two objects to be separated.

Recall the expression of the dual problem you just got:

We can modify the red part to make:

What this formula does is map a linear space to a high-dimensional space. There are many kinds of k (x, xj). Here are two typical ones:

The upper kernel is called the polynomial kernel, and the lower kernel is called the Gaussian kernel. The Gaussian kernel even maps the original space to an infinite dimensional space. In addition, the kernel function has some good properties, for example, it will not add much extra computation than under linear conditions, and so on. Generally speaking, for a problem, different kernel functions may bring different results, which generally need to be tried to get.

5. Some other questions:

How to carry out the multi-classification problem:

In the case of N classification, there are mainly two ways, one is 1 vs and the other is 1 vs 1. In the former method, we need to train N classifiers, and the I classifier is to see whether it belongs to classification I or the complement set of classification I.

In the latter way, we need to train N * (N-1) / 2 classifiers, which can determine whether a point belongs to I or j.

This method is not only used in SVM, but also widely used in many other classifications. From the conclusion of Professor Lin (author of libsvm), 1 vs 1 is better than 1 vs (N-1).

Can SVM overfitting?

SVM avoids overfitting, one is to adjust the C in the penalty function mentioned earlier, and the other is actually from the perspective of the formula, does min | w | ^ 2 look familiar? We saw this formula in the least square regression, which makes the function smoother, so SVM is not an easy way to over-fitting.

This is the end of the article on "what are the basic knowledge points of SVM in machine learning". Thank you for reading! I believe you all have a certain understanding of "what are the basic SVM knowledge points of machine learning". If you want to learn more knowledge, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.