Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to optimize K-means clustering speed for time series data

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to optimize the K-means clustering speed for time series data. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Time series data (Time Series Data) are sorted by time. Interest rate, exchange rate and stock price are all time series data. The time interval of time series data can be minutes and seconds (such as high-frequency financial data), or it can be day, week, month, quarter, year, and even larger time units. New Relic, a provider of data analysis solutions, introduced on its blog the method of optimizing K-means clustering speed for time series data. The author makes a compilation and introduction to this article.

At New Relic, we collect 1.37 billion data points every minute. A large part of the data we collect, analyze and present for our customers is time series data. In order to create relationships between applications and other entities, such as servers and containers, in order to build new intelligent products such as New Relic Radar, we are constantly exploring faster and more efficient ways to group time series data. Given that the amount of data we collect is so large, faster clustering time is essential.

Accelerated k-means clustering

K-means clustering is a popular method of grouping data. The basic principle of the k-means method involves determining the distance between each data point and dividing them into meaningful clusters. We usually use two-dimensional data on a plane to demonstrate this process. Clustering in a more than two-dimensional way is certainly feasible, but the process of visualizing this kind of data becomes more complex. For example, the following figure shows the convergence of k-means clustering in two arbitrary dimensions after several iterations:

Unfortunately, this method does not work well for time series data, because they are usually one-dimensional data that changes over time. However, we can still use some different functions to calculate the distance factor (distance factor) between two time series data. In these cases, we can use mean square error (MSE) to explore different k-mean implementations. In the process of testing these implementations, we have noticed that there are serious problems with the performance level of many implementations, but we can still demonstrate possible ways to accelerate k-means clustering and even achieve an order of magnitude improvement in some cases.

Here we will use Python's NumPy package. If you decide to start following the exercise, you can copy and paste the code directly into Jupyter Notebook. Let's start by importing the package, which is something we'll always use:

Import time import numpy as np import matplotlib.pyplot as plt matplotlib inline

In the next test, we first generate 10000 random time series data, each with a sample length of 500. Then we add noise to the sine wave of random length. Although this kind of data is not ideal for the k-means clustering method, it is sufficient to complete the unoptimized implementation.

N = 10000 ts_len = 500phases = np.array (np.random.randint (0,50, [n, 2])) pure = np.sin ([np.linspace (- np.pi * x [0],-np.pi * x [1], ts_len) for x in phases]) noise = np.array ([np.random.normal (0,1]) Ts_len) for x in range (n)]) signals = pure * noise # Normalize everything between 0 and 1 signals + = np.abs (np.min (signals)) signals / = np.max (signals) plt.plot (signals [0])

* implementation

Let's start with the most basic and straightforward implementation. Euclid_dist can implement a simple MSE estimator for the distance function, and k_means can implement the basic k-means algorithm. We select num_clust random time series data from our initial data set as the centroid (representing the center of each cluster). During the num_iter iteration, we will continuously move the centroids while minimizing the distance between these centroids and other time series data.

Def euclid_dist (T1, T2): return np.sqrt (t1-t2) * * 2). Sum () def k_means (data, num_clust, num_iter): centroids = signals [np.random.randint (0, signals.shape [0], num_clust)] for n in range (num_iter): assignments= {} for ind I in enumerate (data): min_dist = float ('inf') closest_clust = None for c_ind, j in enumerate (centroids): dist = euclid_dist (I, j) if dist

< min_dist: min_dist = dist closest_clust = c_ind if closest_clust in assignments: assignments[closest_clust].append(ind) else: assignments[closest_clust]=[] assignments[closest_clust].append(ind) for key in assignments: clust_sum = 0 for k in assignments[key]: clust_sum = clust_sum + data[k] centroids[key] = [m / len(assignments[key]) for m in clust_sum] return centroidst1 = time.time() centroids = k_means(signals, 100, 100) t2 = time.time() print("Took {} seconds".format(t2 - t1)) Took 1138.8745470046997 seconds 聚类这些数据用去了接近 20 分钟。这不是很糟糕,但肯定算不上好。为了在下一个实现中达到更快的速度,我们决定去掉尽可能多的 for 循环。 向量化的实现 使用 NumPy 的一大优势是向量化运算。(如果你不太了解向量化运算,请参考这个链接:http://www.scipy-lectures.org/intro/numpy/operations.html) k-均值算法要求每个质心和数据点都成对地进行比较。这意味着在我们之前的迭代中,我们要将 100 个质心和 10000 个时间序列数据分别进行比较,也就是每次迭代都要进行 100 万次比较。请记住每次比较都涉及到两个包含 500 个样本的集合。因为我们迭代了 100 次,那就是说我们总共比较了 1 亿次——对于单个 CPU 而言算是相当大的工作量了。尽管 Python 是一种还算高效的语言,但效率还赶不上用 C 语言写的指令。正是由于这个原因,NumPy 的大部分核心运算都是用 C 语言写的,并且还进行了向量化以最小化由循环带来的计算开销。 我们来探索一下我们可以如何向量化我们的代码,从而去掉尽可能多的循环。 首先,我们将代码分成不同的功能模块。这能让我们更好地理解每个部分所负责的工作。接下来,我们修改 calc_centroids 步骤以便仅在质心上迭代(而不是在每个时间序列数据上)。这样,我们将所有时间序列数据和一个质心传递给 euclid_dist。我们还可以预先分配 dist 矩阵,而不是将其当成一个词典进行处理并随时间扩展它。NumPy 的 argmin 可以一次性比较每个向量对。 在 move_centroids 中,我们使用向量运算去掉了另一个 for 循环,而且我们只在独特的质心集上迭代。如果我们丢失了一个质心,我们就通过从我们的时间序列数据集中进行随机选择来加入合适的数字(这在实际应用的实践中很罕见)。 ***,我们添加一个提前停止(early stopping)来检查 k_means——如果质心不再更新,就停止迭代。 来看看代码: def euclid_dist(t1, t2): return np.sqrt(((t1-t2)**2).sum(axis = 1)) def calc_centroids(data, centroids): dist = np.zeros([data.shape[0], centroids.shape[0]]) for idx, centroid in enumerate(centroids): dist[:, idx] = euclid_dist(centroid, data) return np.array(dist) def closest_centroids(data, centroids): dist = calc_centroids(data, centroids) return np.argmin(dist, axis = 1) def move_centroids(data, closest, centroids): k = centroids.shape[0] new_centroids = np.array([data[closest == c].mean(axis = 0) for c in np.unique(closest)]) if k - new_centroids.shape[0] >

0: print ("adding {} centroid (s)" .format (k-new_ centroids.shape [0]) additional_centroids = data [np.random.randint (0, data.shape [0], k-new_centroids.shape [0])] new_centroids = np.append (new_centroids, additional_centroids, axis = 0) return new_centroids def k_means (data, num_clust, num_iter): centroids = signals [np.random.randint (0]) Signals.shape [0], num_clust) last_centroids = centroids for n in range (num_iter): closest = closest_centroids (data, centroids) centroids = move_centroids (data, closest, centroids) if not np.any (last_centroids! = centroids): print ("early finish!") Break last_centroids = centroids return centroidst1 = time.time () centroids = k_means (signals, 100,100) T2 = time.time () print ("Took {} seconds" .format (T2-T1)) adding 1 centroid (s) early finish! Took 206.72993397712708 seconds

It takes a little more than 3.5 minutes. Very nice! But we want to finish it faster.

The implementation of kmurmeanskeeper +

Our next implementation uses the kmurmeanskeeper + algorithm. The purpose of this algorithm is to select a better initial centroid. Let's see if this optimization method works.

Def init_centroids (data, num_clust): centroids = np.zeros ([num_clust, data.shape [1]]) centroids [0for i in range:] = data [np.random.randint (0, data.shape [0], 1)] for i in range (1, num_clust): D2 = np.min ([np.linalg.norm (data-c, axis = 1) * * 2 for c in centroids [0VANI,:] Axis = 0) probs = D2/D2.sum () cumprobs = probs.cumsum () ind = np.where (cumprobs > = np.random.random ()) [0] [0] centroids [I,:] = np.expand_dims (data [ind], axis = 0) return centroids def k_means (data, num_clust, num_iter): centroids = init_centroids (data Num_clust) last_centroids = centroids for n in range (num_iter): closest = closest_centroids (data, centroids) centroids = move_centroids (data, closest, centroids) if not np.any (last_centroids! = centroids): print ("Early finish!") Break last_centroids = centroids return centroidst1 = time.time () centroids = k_means (signals, 100,100) T2 = time.time () print ("Took {} seconds" .format (T2-T1)) early finish! Took 180.91435194015503 seconds

Compared to our previous iterations, adding the kmurmeanskeeper + algorithm can get slightly better performance. However, when we parallelize it, this optimization method really begins to yield significant returns.

Parallel implementation

So far, all of our implementations have been single-threaded, so we decided to explore the parallelization part of the kmurmeanskeeper + algorithm. Because we are using Jupyter Notebook, we chose to use ipyparallel for parallel computing to manage parallelism (ipyparallel address: https://github.com/ipython/ipyparallel). With ipyparallel, we don't have to worry about the entire server forking, but we need to solve some special problems. For example, we must instruct our worker node to load NumPy.

Import ipyparallel as ipp c = ipp.Client () v = c [:] v.use_cloudpickle () with v.sync_imports (): import numpy as np

For more information about loading the worker, see the ipyparallel getting started Guide: https://ipyparallel.readthedocs.io/en/latest/

In this implementation, we focus on two aspects of parallelization. First, calc_centroids has a cycle that iterates over each centroid and compares it with our time series data. We used map_sync to send each of these iterations to our worker.

Next, we parallelize a similar loop in kmurmeanssearch + centroid search. Notice the call to v.push: because of the data referenced by our lambda, we need to make sure that it is available on the worker node. We achieve this goal by calling the push method of ipyparallel to copy the variable to the global scope of the worker.

Look at the code:

Def calc_centroids (data, centroids): return np.array (v.map_sync (lambda x: np.sqrt (x-data) * * 2). Sum (axis= 1)), centroids) def closest_centroids (points, centroids): dist = calc_centroids (points, centroids) return np.argmin (dist, axis=0) def init_centroids (data, num_clust): v.push (dict (data=data)) centroids = np.zeros ([num_clust) Data.shape [1]]) centroids [0Magneol:] = data [np.random.randint (0, data.shape [0], 1)] for i in range (1, num_clust): D2 = np.min (v.map_sync (lambda c: np.linalg.norm (data-c, axis = 1) * * 2, centroids [0np.random.randint iMagne:]) Axis = 0) probs = D2/D2.sum () cumprobs = probs.cumsum () ind = np.where (cumprobs > = np.random.random ()) [0] [0] centroids [I,:] = np.expand_dims (data [ind], axis = 0) return centroidst1 = time.time () centroids = k_means (signals T2 = time.time () print ("Took {} seconds" .format (T2-T1)) adding 2 centroid (s) early finish! Took 143.49819207191467 seconds

The result is only a little more than two minutes, which is the fastest speed we have achieved so far!

Coming up: faster!

In these tests, we only used the central processing unit (CPU). CPU can provide convenient parallelization, but we think that with a little more effort, we can use graphics processing unit (GPU) to achieve clustering, and the speed will be increased by an order of magnitude. We may be able to do this using TensorFlow, an open source software for numerical computing and machine learning. In fact, TensorFlow already includes a k-means implementation, but we almost certainly need to adjust it to be used in time series clustering. In any case, we will not stop looking for faster and more efficient clustering algorithms to help manage our users' data.

On how to optimize the K-means clustering speed for time series data is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report