Read the article to understand how to do cluster analysis on mixed data! 07/03 Update SLTechnology News&Howtos

Read the article to understand how to do cluster analysis on mixed data!

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Generally speaking, clustering unsupervised data is not an easy task. Nowadays, the data processing and exploration can not measure the data accurately. It also means that it becomes more and more difficult for us to process and explore the data.

In addition, the k-means tutorial, the ideal case for discussion in an introductory course on unsupervised learning, applies only to numerical properties.

In this article, the author will conduct unsupervised classification training through R language.

The first part includes methodology: the author is discussing the use of the mathematical concept of distance to measure the similarity between individuals. Then it introduces the PAM clustering algorithm (segmentation around medoids) and the method of selecting the best number of clusters (contour coefficient).

In the second part, the author will use the bank marketing dataset provided in the uci machine learning database and some functions in the Rtsne software package to illustrate this method. The data set is related to the telemarketing activities of a Portuguese banking institution. We will use these data to discuss the monitoring of learning.

Part one: methodology

How to measure similarity

The role of data scientists is that when clustering unknown data, we can't blindly touch the elephant and only see one side of things. They advocate a certain "distance" from the data in order to understand them in a more comprehensive way. )

Distance is a numerical measure of how far apart individuals are, that is, a measure of proximity or similarity between individuals. In the face of many metrics, the author must introduce Gower distance (1971).

The Gower distance is used to calculate the average of partial differences between individuals. (the range of Gower distance is [0 1]. )

Where dissimilar (d_ij ^ f) calculations depend on the type of variable being evaluated. This means that each property should have a fixed standard, and the distance between two individuals is the average of all characteristic distances.

For the numerical property f, part of the difference is: the absolute difference observed, the ratio between x and j, the maximum range observed from all individuals: d_ij ^ f = | x, I-x, j | / | (max_N (x)-min_N (x)) |, N is the number of individuals in the dataset.

Calculation of partial differences in numerical properties (Raff = maximum range observed)

For the qualitative property f, the degree of partial dissimilarity is equal to 1 only if the observed values are different. Otherwise, 0.

Note: the Gower distance can use the daisy () function in the R cluster package. The first automatic normalization feature (that is, rescaling to fall within the [0 1] range).

Clustering algorithm: partition around MEDOIDS (PAM)

Gower distance is very similar to k-medoids algorithm. K-medoid is a classical clustering technology, which changes the dataset cluster of n objects into k known clusters.

Very similar to the k-means algorithm, PAM has the following characteristics:

Pros: compared to k-means (due to the use of distance attributes), it is more intuitive, more sensitive to noise and outliers, and it produces a "typical individual" for each cluster.

Disadvantages: it is time-consuming and computer-intensive (runtime and memory are secondary).

Evaluate consistency within a data cluster

Unless you have a good a priori principle to force a specific number of cluster k, you may ask the computer for statistics-based recommendations. There are several ways to limit the correlation of the selected cluster. In the second part, we use the contour coefficient.

explain

There are basically two ways to investigate the results of this cluster practice in order to come up with some professional explanations.

1. Each cluster basically uses the summary () function in R.

two。 Learn to use t-SNE, a technology for dimensionality reduction, which is especially suitable for visualization of high-dimensional datasets.

We introduced these two situations in the use case (part II). Let's apply and explain!

Part II: use cases

In this use case, we will try to group bank customers based on the following characteristics:

Age (number)

Job type (category): 'administration', 'blue collar', 'entrepreneur', 'maid', 'management', 'retirement', 'self-employment', 'service', 'student', 'technician', 'unemployment', 'unknown'

Marital status (category): 'divorced', 'married', 'single', 'unknown'

Education (category): 'primary', 'intermediate', 'junior college', 'unknown'

Breach of contract: is there any record of default? (category): 'none', 'have', 'unknown'

Balance (figure): average annual balance in euros

Housing: is there a housing loan? (category): 'none', 'have', 'unknown'

Similar and different customers based on Gower distance:

In a business environment, we usually search for clusters that are meaningful and easy to remember, that is, a maximum of 2 to 8 clusters. The outline map helps us to determine the best choice.

7 has the highest profile width. But 5 is simpler. We choose k = 5.

explain

Summary of each cluster

Here, you can try to derive some common patterns for customers in the cluster. For example, cluster 1 consists of "management x level 3 x non-default x no housing" customers, and cluster 2 consists of "blue-collar x second-level x non-default x housing" customers.

Visualization in lower dimensional space

Although not perfect (especially in Cluster 3), most of the colors are located in similar areas, which confirms the relevance of the partition.

Conclusion

This paper reviews the author's ideas when trying to implement clustering algorithms on mixed unsupervised datasets. The author thinks that it can bring some interesting ideas to other data scientists for sharing.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.