In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article will explain in detail how the Clustering analysis based on user portraits in big data is. The content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.
Clustering, as its name implies, is "birds of a feather flock together, people are divided into groups". Its main idea is to assemble data into different clusters according to specific standards, so that the similarity of data objects in the same cluster is as large as possible, and at the same time, the differences of data objects that are not in the same cluster are as large as possible. Popularly speaking, it is to divide similar objects into the same group.
Clustering algorithms usually do not use training data, as long as calculate the similarity between objects, the algorithm can be applied. This is called unsupervised learning in the field of machine learning.
A large insurance company has a large amount of insured customer data. Due to the shortage of big data technology and related personnel, the enterprise has not yet established a unified data warehouse and operation platform, and the data accumulated for many years can not play its due value. Enterprises expect to build user portraits, carry out group analysis and personalized operation of customers, so as to activate old customers and tap tens of billions of renewal market. Zhongan science and technology data team models the enterprise data, outputs user portraits and builds an intelligent marketing platform. Then carry on the customer clustering research based on the user profile data and formulate the personalized operation strategy.
This paper focuses on the practice of clustering algorithm.
Step 1 data preprocessing
In any big data project, pre-data preparation is a tedious and boring but very important work.
First of all, standardize the data, deal with outliers and fill in the missing values. in order to apply the clustering algorithm smoothly, it is also necessary to make all the tags in the user's portrait in numerical form.
Secondly, the numerical indicators should be scaled so that each index has the same order of magnitude, otherwise the clustering results will be biased.
The next step is to extract features, that is, to reduce the dimension of the initial feature set, and select effective features to run in the clustering algorithm. There are more than 200 tags in the user portraits customized by Zhongan Technology for the insurance company, providing rich multi-dimensional data support for different operational scenarios. However, so many tags have related features, if there are two highly related features, which is equivalent to twice the weight of the same feature, it will affect the clustering results.
We can find and exclude highly related features through association rule analysis (Association Rules), and we can also reduce dimensions through principal component analysis (Principal Components Analysis, referred to as PCA). It will not be carried out in detail here, and interested readers can understand it for themselves.
Step 2 determines the number of clusters
Hierarchical clustering is a very commonly used clustering algorithm, which merges the nearest objects according to the distance between each two objects, and then merges the new objects after merging, and so on, until all objects are combined into one class.
The classification effect of Ward method is better in practical application, and it is widely used. It is mainly based on the idea of analysis of variance. Ideally, the sum of squares of deviations between similar objects should be as small as possible, and the sum of squares of deviations between different objects should be as large as possible. This method requires that the distance between samples must be Euclidean distance.
It is worth noting that in R, the name of the calling ward method has been updated from "ward" to "ward.D".
Library (proxy) Dist
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.