In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly explains "what is python clustering analysis". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what python clustering analysis is.
What is cluster analysis?
Clustering analysis or clustering is the task of grouping a group of objects, making the objects in the same group (called clustering) more similar (in a sense) to the objects in other groups (clustering). It is not only the main task of exploratory data mining, but also a common technology of statistical data analysis. it is used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression and computer graphics.
Clustering analysis itself is not a specific algorithm, but a general task to be solved. It can be implemented through a variety of algorithms that differ significantly in understanding the composition of clusters and how to find them effectively. Popular clustering concepts include groups with small distances between cluster members, dense areas of data space, intervals, or specific statistical distributions. Therefore, clustering can be expressed as a multi-objective optimization problem. Appropriate clustering algorithms and parameter settings (including parameters such as distance functions), density thresholds or the number of expected clusters) depend on the intended use of individual data sets and results. Such clustering analysis is not an automatic task, but an iterative process involving experimental and failed knowledge discovery or interactive multi-objective optimization. You usually need to modify data preprocessing and model parameters until the results reach the desired properties.
Common clustering methods
Common clustering algorithms are divided into algorithms based on partition, hierarchy, density, grid, statistics, model and so on. Typical algorithms include K-means (classical clustering algorithm), DBSCAN, two-step clustering, BIRCH, spectral clustering and so on.
K-means
K-means is one of the most commonly used clustering algorithms, but k-means should pay attention to data anomalies:
Data outlier. The abnormal values in the data can obviously change the distance recognition between different points, and this effect is very significant. Therefore, under the discrimination mode based on distance similarity, the processing of outliers is essential.
The abnormal dimension of the data. If there are differences in numerical scale or dimensions between different dimensions and variables, the variables need to be normalized or standardized before making the distance. For example, the numerical distribution range of the bounce rate is [0recover1], the order amount may be [010000000], and the order quantity is [010000000]. If there is no normalization or standardization operation, then the similarity will be mainly affected by the order amount.
DBSCAN
Abnormal data can be processed by DBSCAN clustering method. The full name of DBSCAN is Density-Based Spatial Clustering of Applications with Noise, which means "spatial clustering with noise based on density" in Chinese.
Compared with K-means, it has the following advantages:
There is no obvious requirement for the distribution law of the original data, and it can adapt to the spatial clustering of any distribution shape of the data set, so the applicability of the data set is wider, especially for the identification of non-convex, circular and other heterosexual clusters.
There is no need to specify the number of clusters, and the a priori requirements for the results are not high.
Because DBSCAN can distinguish core objects, boundary points and noise points, it has a good filtering effect on noise and can effectively deal with data noise points.
Because he operates on the whole data set and uses a global parameter to represent the density when clustering, there are obvious weaknesses:
For the problem of high latitudes, the definition based on radius and density is a problem.
When the density of the cluster changes too much, the clustering result is poor.
When the amount of data increases, a large amount of memory support is required, and the consumption of Imax O is also very large.
MiniBatchKMeans
K-means performs very well in terms of algorithm stability, efficiency and accuracy (relative to the discrimination of real tags), and remains the same when dealing with large amounts of data. The upper bound of the time complexity of the algorithm is O (nkt), where n is the sample size, k is the clustering number of partition, and t is the number of iterations. When the number of clusters and iterations are constant, the time consumed by the K-means algorithm is only related to the sample size, so it will show a linear growth trend.
However, when facing large amounts of data, the slow calculation speed of the k-means algorithm will cause delay, especially when the algorithm is used for real-time processing. In order to solve the problem of K-means, many extension algorithms have appeared, and MiniBatchKMeans is a typical representative. MiniBatchKMeans uses a method called MiniBatch (batch processing) to calculate the distance between data points. The advantage of Mini Batch is that it does not need to use all the data samples in the calculation process, but takes some samples (not all samples) from different categories of samples as representatives to participate in the clustering algorithm process. Because the calculated sample size is small, the running time will be reduced accordingly; but on the other hand, because of the sampling method, the sampling sample is difficult to fully represent all the characteristics of the whole sample, so it will bring a small decrease in accuracy, but it is not obvious.
Spectral clustering
Under the background of big data, there are many high-latitude data scenarios, such as e-commerce transaction data and web text data. The clustering of high-dimensional data takes a long time, and the accuracy and stability of clustering results are not satisfactory. Because, in high-dimensional data, the calculation efficiency of distance-based similarity is very low; the possibility that too many eigenvalues exist clusters in all dimensions is very low; because of sparsity and neighbor characteristics, distance-based similarity is almost zero, resulting in high-dimensional space is very difficult to appear data clusters. At this time, we can choose to use subspace clustering or dimensionality reduction.
Subspace clustering algorithm is an extension of traditional clustering algorithm in high-dimensional data space. its idea is to select the dimensions closely related to a given cluster and then cluster in the corresponding subspace. For example, spectral clustering is a subspace clustering method, because the method of selecting correlation dimension and the method of evaluating subspace need to be customized, so this method has higher requirements for operators.
Intermediate preprocessing using cluster analysis
Image compression
The process of representing the original pixel matrix with a small amount of data is called image coding. The remarkable feature of the data image is the large amount of data, which needs to occupy a large storage space, which brings inconvenience to the image storage, calculation, transmission and so on. Therefore, most of the images under the digital network will be compressed before further application, and one of the methods of image compression is clustering algorithm.
When using clustering algorithm for image compression, we will define the number of K colors (for example, 128 colors), and the number of colors is the number of clustering categories; K-means clustering algorithm will put similar colors in K clusters, and then each cluster uses a color instead of the original color, then the result is that how many clusters there are, the image composed of how many colors is generated, thus achieving image compression.
Image segmentation
Image segmentation is to divide the image into several specific regions with unique properties and put forward the target technology and process of interest, which is the key step of image processing and analysis. The target extracted after image segmentation can be used in image semantic recognition, image search and other fields. For example, the foreground face information is segmented from the image, and then the face recognition is done. Clustering algorithm is one of the image segmentation methods, the key to its implementation is to cluster through obviously different image color features between different regions, and the number of clustering is the number of regions to be segmented.
Image understanding
In image understanding, there is a method called region-based extraction. The region-based extraction method is carried out on the premise of image segmentation and object recognition, using object templates and scene classifiers to mine semantics by recognizing the topological relationship between objects and objects to generate the corresponding scene semantic information. For example, the segmented image regions are clustered with color, shape and other features to form a small amount of BLOB;, and then the probability of co-occurrence of BLOB and some keywords is calculated by CMRM model.
Anomaly detection
There are many methods to implement anomaly detection, among which the commonly used method is distance-based anomaly detection. Even if the dataset does not satisfy any specific distribution model, it can still effectively find outliers, especially when the spatial dimension is relatively high, the efficiency of the algorithm is much higher than the density-based method. When the algorithm is implemented, we first calculate the distance between the data samples (such as Manhattan distance, Euclidean distance, etc.), and then after preprocessing the data, we can detect outliers according to the definition of distance.
For example, K-means clustering can be used to extract data points that are farthest from the central store or that do not belong to any class, and then define them as outliers.
The choice of clustering algorithm:
If the data is high-dimensional data, then choose subspace clustering (such as spectral clustering).
If the amount of data is less than 1 million, it is better to use k-means; if the amount of data is more than 1 million, you can consider using Mini Batch KMeans
If there is noise in the data, you can use density-based DBSCAN
If the highest classification accuracy is obtained, then selective spectral clustering will be more accurate than K-means.
Python code implementation
Import numpy as npimport pandas as pdfrom sklearn.cluster import KMeansfrom sklearn import metricsimport matplotlib.pyplot as pltfrom sklearn.datasets import make_blobs%matplotlib inline# data preparation data = make_blobs (n_samples=2000, centers= [[1mai 1], [- 1,-1]], cluster_std=0.7, random_state=2018) X = data [0] y = data [1] # set the number of clusters kmeans = clustering model objects kmeans = KMeans (n_clusters=n_clusters Random_state=2018) # training clustering Model kmeans.fit (X) # Predictive clustering Model pre_y = kmeans.predict (X) # # Model effect Index Evaluation # the sum of the nearest clustering centers inertias = kmeans.inertia_# adjusted Rand Index adjusted_rand_s = metrics.adjusted_rand_score (y, pre_y) # Mutual Information mutual_info_s = metrics.mutual_info_score (y Pre_y) # adjusted mutual information adjusted_mutual_info_s = metrics.adjusted_mutual_info_score (y, pre_y) # homogenization score homogeneity_s = metrics.homogeneity_score (y, pre_y) # integrity score completeness_s = metrics.completeness_score (y, pre_y) # V-measure score v_measure_s = metrics.v_measure_score (y, pre_y) # average profile factor silhouette_s = metrics.silhouette_score (X Pre_y, metric='euclidean') # Calinski and Harabaz scores calinski_harabaz_s = metrics.calinski_harabaz_score (X, pre_y) df_metrics = pd.DataFrame ([[inertias, adjusted_rand_s,mutual_info_s, adjusted_mutual_info_s, homogeneity_s,completeness_s,v_measure_s, silhouette_s, calinski_harabaz_s]], columns= ['ine','tARI','tMI','tAMI','thomo','tcomp' 'tv_m','tsilh','tc&h']) df_metrics
# # Model Visualization # # centers = kmeans.cluster_centers_# Color setting colors = ['green',' pink'] # create canvas plt.figure (figsize= (12pre_y 6)) titles = ['Real',' Predict'] for j, y _ in enumerate ([y, pre_y]): plt.subplot (1Mague 2) Plt.title (titles [j]) # Loop category for i in range (n_clusters): # find the same index index_sets = np.where (y _ = = I) # divide similar data into a cluster subset cluster = X [index _ sets] # display sample points plt.scatter (cluster [:, 0], cluster [:, 1] C=colors [I], marker='.') If jungles 1: # Cluster center plt.plot (centers [I] [0], centers [I] [1], 'markeredgecolor='k', markersize=6) plt.savefig (' xx.png') plt.show ()
Analysis of evaluation indicators:
Inertias:inertias is the attribute of the K-means model object, which represents the sum of the nearest clustering centers of the samples. it is used as an unsupervised evaluation index without real classification results. The smaller the value, the better. The smaller the value, the more concentrated the distribution of samples between classes, that is, the smaller the distance within the class.
Adjusted_rand_s: adjusted Rand Index (Adjusted Rand Index), which calculates the similarity measure between two clusters by considering all sample pairs and count pairs allocated in the same or different clusters in predicted and real clusters. The adjusted Rand index obtains a value close to 0 independent of the sample size and category through the adjustment of the Rand index, and its value range is [- 1,1]. The negative number represents a bad result, and the closer to 1, the better means that the clustering result is more consistent with the real situation.
Mutual_info_s: mutual information (Mutual Information, MI). Mutual information is the amount of information contained in one random variable about another random variable, which here refers to a measure of the similarity between two tags of the same data, and the result is non-negative.
Adjusted_mutual_info_s: the adjusted mutual information (Adjusted Mutual Information, AMI). The adjusted mutual information is the adjustment score of the mutual information score. It takes into account that for a larger number of clusters, the MI is usually higher, and regardless of whether there is actually more information sharing, it corrects this effect by adjusting the probability of the cluster. When the two cluster sets are the same (that is, an exact match), the AMI returns a value of 1; the average expected AMI for random partitions (independent tags) is about 0, or it may be negative.
Homogeneity_s: homogenization score (Homogeneity). If all clusters contain only data points belonging to members of a single class, the clustering results will satisfy homogeneity. The larger the value range [0jue 1] is, the more consistent the clustering result is with the real situation.
Completeness_s: integrity score (Completeness). If all data points that are members of a given class are elements of the same cluster, the clustering result satisfies integrity. The range of the value is [0jue 1]. The higher the value is, the more consistent the clustering result is with the real situation.
V_measure_s: it is the average value of harmonics between homogenization and integrity, v = 2 (uniformity integrity) / (uniformity + integrity). The range of the value is [0jue 1]. The higher the value is, the more consistent the clustering result is with the real situation.
Silhouette_s: contour coefficient (Silhouette), which is used to calculate the average contour coefficient of all samples, using the average intra-group distance and the average nearest cluster distance of each sample. It is an unsupervised evaluation index. The highest value is 1 and the worst value is-1. The value near 0 indicates overlapping clustering, and a negative value usually indicates that the sample has been assigned to the wrong cluster.
Calinski_harabaz_s: the score is defined as the ratio of intra-cluster discretization to inter-cluster discretization, which is an unsupervised evaluation index.
Thank you for your reading, the above is the content of "what is python clustering analysis", after the study of this article, I believe you have a deeper understanding of what python clustering analysis is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.