What is DBSCAN in machine learning? 04/19 Update SLTechnology News&Howtos

What is DBSCAN in machine learning?

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is DBSCAN in machine learning". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn what DBSCAN is in machine learning.

I. introduction

DBSCAN is a density-based clustering algorithm, which generally assumes that categories can be determined by the compactness of sample distribution. The samples of the same category are closely related to each other, that is to say, there must be samples of the same category not far from any sample of that category.

By dividing the closely connected samples into one category, a cluster category is obtained. By dividing all the closely connected samples into different categories, we get the final clustering results.

II. Definition of DBSCAN density

In the previous section, we qualitatively described the basic idea of density clustering. In this section, we will look at how DBSCAN describes density clustering. DBSCAN is based on a set of neighborhoods to describe the compactness of the sample set, and the parameters (neighborhood, MinPts) are used to describe the compactness of the sample distribution in the neighborhood.

Among them, neighborhood describes the neighborhood distance threshold of a sample, and MinPts describes the threshold of the number of samples in the neighborhood where the distance of a sample is neighborhood.

Assuming that my sample set is D = (x _ 1 ~ x _ 2), then the specific density description of DBSCAN is defined as follows:

1) neighborhood-neighborhood: the region of D sample set whose distance from xj is less than ∈ is called neighborhood-neighborhood. The number of points in this area is marked as: n points (xj) |.

Note: 1. In code practice, distance needs to be defined by the business itself. 2. The number of neighbors is not less than 2, otherwise the points connected by density can not be merged into a cluster, and the original meaning of DBSCAN will be lost.

2) Core object: for any xj ∈ D, if the xj corresponding to the neighbor-neighborhood contains at least MinPts samples, that is, if | naming (xj) | ≥ MinPts, then xj is the core object. In vernacular, the premise of being called the core object is that there are more close points around this point.

3) density direct (or direct density reachable): if xi is located in the neighborhood of xj and xj is the core object, xi is said to be direct by xj density. Note that the opposite may not be true, that is, it cannot be said that xj is directly accessible by xi density at this time, unless xi is also the core object.

4) the density is reachable: for xi and xj, if there is a sample sequence p1dheme p2rect, which satisfies p1fuximempTonomxj and pi+1 is directly reached by pi density, then xj is said to be reachable by xi density. The vernacular is that xi and xj can be connected through other core objects.

At this time, the transfer samples in the sequence are all core objects, because only the core objects can make the density of other samples reach directly. Note that the density can reach or does not satisfy the symmetry, which can be obtained from the asymmetry of the direct density.

5) density connection: for xi and xj, if there is a core object sample xk, so that both xi and xj can be reached by xk density, then xi and xj density are connected. Note that the density connection is symmetrical.

You can easily understand the above definition from the figure below. The MinPts=5 and red dots in the figure are all core objects because there are at least 5 samples in the neighborhood. The black sample is a non-core object. The samples with direct density of all core objects are in the hyperspheres centered on the red core objects, but cannot be densely accessible if they are not in the hyperspheres. The core objects connected by green arrows in the figure form a sequence of samples with up to density. In the density-neighborhood of these density-accessible sample sequences, all samples are densely connected to each other.

With the above definition, the definition of clustering in DBSCAN is simple.

Third, DBSCAN density clustering idea.

The definition of DBSCAN clustering is very simple: the maximum density connected sample set derived from the density reachability relation is a category or a cluster of our final clustering.

The cluster of this DBSCAN can have one or more core objects. If there is only one core object, the other non-core object samples in the cluster are in the neighborhood of the core object; if there are multiple core objects, then there must be one other core object in the neighborhood of any core object in the cluster, otherwise the density of the two core objects cannot be reached. The neighbors of these core objects-A DBSCAN cluster consisting of a collection of all samples in the neighborhood.

So how can we find such a cluster sample collection? The method used by DBSCAN is very simple, it randomly selects a core object without a category as a seed, and then finds all the sample sets that can reach the density of the core object, that is, a cluster. Then continue to select another core object without a category to find a sample set with a density that can reach, so as to get another cluster. Run until all core objects have categories.

Basically, this is the main content of the DBSCAN algorithm, isn't it very simple? But there are still three questions we haven't considered.

The first is some abnormal sample points or a small number of sample points outside the cluster, which are not around any of the core objects. In DBSCAN, we generally mark these sample points as noise points.

The second is the measurement of distance, that is, how to calculate the distance between a sample and the core object sample. In DBSCAN, the idea of nearest neighbor is generally adopted, and a certain distance measure is used to measure the sample distance, such as Euclidean distance. This is exactly the same as the nearest neighbor idea of KNN classification algorithm. Corresponding to a small number of samples, finding the nearest neighbor can directly calculate the distance of all samples. If the sample size is large, KD tree or ball tree is generally used to quickly search the nearest neighbor. If you are not familiar with the idea of the nearest neighbor, distance measurement, KD tree and ball tree, you should refer to another article written earlier, a summary of the principle of K-nearest neighbor method (KNN).

The third problem is special, some samples may be less than the distance between the two core objects, but because the two core objects are not density direct, and do not belong to the same cluster, then how to define the category of this sample? Generally speaking, at this time, DBSCAN uses first-come-first-served, and the category cluster that clusters first will mark the sample as its category. In other words, DBSCAN's algorithm is not completely stable.

Summary of DBSCAN

Compared with the traditional K-Means algorithm, the biggest difference of DBSCAN is that it does not need to input the number of categories k. Of course, its biggest advantage is that it can find clusters of arbitrary shapes, instead of being like K-Means, which is generally only used in convex sample clusters. At the same time, it can also find outliers while clustering, which is similar to BIRCH algorithm.

So when do we need to use DBSCAN to cluster? In general, if the dataset is dense and the dataset is not convex, then DBSCAN clustering is much better than K-Means clustering. If the dataset is not dense, DBSCAN clustering is not recommended.

Here is a summary of the advantages and disadvantages of the DBSCAN algorithm.

The main advantages of DBSCAN are:

The main results are as follows: 1) We can cluster dense data sets with arbitrary shape. In contrast, clustering algorithms such as K-Means are generally only suitable for convex data sets.

2) outliers can be found while clustering, and they are not sensitive to outliers in data sets.

3) the clustering results are not biased. On the other hand, the initial values of clustering algorithms such as K-Means have a great influence on the clustering results.

The main disadvantages of DBSCAN are:

The main results are as follows: 1) if the density of the sample set is uneven and the difference between clusters is large, the clustering quality is poor, so DBSCAN clustering is generally not suitable.

2) if the clustering convergence time is long when the sample set is large, the size of the KD tree or ball tree established when searching the nearest neighbor can be limited to improve.

3) the parameter adjustment is a little more complex than the traditional clustering algorithms such as K-Means, which mainly requires the joint adjustment of distance threshold threshold and neighborhood sample number threshold MinPts. Different parameter combinations have a great impact on the final clustering effect.

At this point, I believe you have a deeper understanding of "what is DBSCAN in machine learning". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.