In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
What this article shares with you is about the unsupervised anomaly detection algorithms commonly used in big data. The editor thinks it is quite practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it with the editor.
The following will introduce the articles about anomaly detection, mainly introduce several unsupervised anomaly detection methods, the experimental part is only for reference.
What is an exception?
In the fairy drama, there is a distinction between the right and the devil, but it is difficult to tell what is the right way and what is the devil's way. People who claim to be right may one day fall into the devil's way, and those who are called demons may be filled with a sense of justice from beginning to end! So, there is no absolute black, no absolute white.
Then again, everyone probably has a yardstick in mind for the word "exception", which is indeed very extensive (such as outliers, abnormal transactions, abnormal behavior, abnormal users, abnormal accidents, etc.), so what is abnormal?
The anomaly deviates obviously from other observed data, so that it is suspected that it does not belong to the same data distribution as the normal point. Anomaly detection is a technique used to identify abnormal patterns that do not conform to the expected behavior, also known as outlier detection.
There are also many applications in business, such as network intrusion detection (identifying special patterns in network traffic that may be attacked by hackers), system health monitoring, credit card transaction fraud detection, equipment fault detection, risk identification and so on. Here, there are three types of exceptions:
Data point exception: if the sample point is too far away from other data, a single data instance is abnormal. Business use case: detect credit card fraud according to the "amount of expenditure".
Context exception: abnormal behavior in time series data. Business use case: it is normal that the cost of a credit card during travel shopping is many times higher than usual, but if it is stolen, it is an exception.
Aggregate exception: it is difficult to distinguish between individual data, and you can only determine whether the behavior is abnormal based on a set of data. Business use case: an ant-moving copy of a file, an exception that is usually a potential network attack.
Anomaly detection is similar to noise elimination and novelty detection. Noise cancellation (NR) is the process of immunoassay in the course of unwanted observation; in other words, noise is removed from other meaningful signals. Novelty detection involves identifying unobserved patterns in new observations that are not included in the training data.
The difficulty of anomaly detection
In the face of real business scenarios, we are often full of passion and ambition, thinking that we can finally get into practice and do something big. However, things are often not so good, the business is usually special, the background is complex, and the process of familiarity with the business will take up most of your time. Then after you rack your brains to turn it into an anomaly detection scenario, you usually face the following major challenges:
There is no clear definition of what is normal and what is abnormal. In some areas, there is no clear distinction between normal and abnormal.
There is noise in the data itself, which makes it difficult to distinguish between noise and anomaly.
Normal behavior is not static, but also evolves over time, such as a series of illegal operations after a normal user's account is stolen.
It is difficult to obtain tagged data: without data, even the best algorithm is useless.
In view of the above challenges, let's take a look at the specific scenarios and the algorithms used for anomaly detection.
Outlier Detection based on Statistics in
Here will be based on statistics as a class of anomaly detection techniques, such as many methods, the following mainly introduces MA and 3-Sigma.
MA moving average method
The easiest way to identify data irregularities is to mark data points that deviate from distribution, including averages, median values, quantiles, and patterns. Assuming that the abnormal data point is a standard deviation from the average, then we can calculate the local average under the sliding window of the time series data, and determine the degree of deviation through the average. This is technically known as the moving average method (moving average,MA), which aims to smooth short-term fluctuations and highlight long-term fluctuations. Moving average also includes cumulative moving average, weighted moving average, exponentially weighted moving average, double exponential smoothing, triple exponential smoothing and so on. Mathematically, n-period simple moving average can also be defined as "low-pass filter".
Disadvantages:
There may be noise data similar to abnormal behavior in the data, so the boundary between normal behavior and abnormal behavior is usually not obvious.
The definition of exception or normal may change frequently because malicious attackers constantly adapt to themselves. Therefore, thresholds based on moving averages may not always apply.
3-Sigma
The 3-Sigma principle, also known as the Rajda Criterion, is defined as follows:
Suppose that a group of test data only contain random error, calculate and process the original data to get the standard deviation, and then determine an interval according to a certain probability, and think that the error beyond this interval is an outlier.
The premise of using 3-Sigma is that the data obeys a normal distribution, of course, if x does not obey a normal distribution, you can use log to turn it into a normal distribution. Here is the Python implementation of 3-Sigma.
# 3-sigma identifies outliers def three_sigma (df_col):''a column of df_col:DataFrame data 'rule = (df_col.mean ()-3 * df_col.std () > df_col) | (df_col.mean () + 3 * df_col.std ())
< df_col) index = np.arange(df_col.shape[0])[rule] outrange = df_col.iloc[index] return outrange 对于异常值检测出来的结果,有多种处理方式,如果是时间序列中的值,那么我们可以认为这个时刻的操作属于异常的;如果是将异常值检测用于数据预处理阶段,处理方法有以下四种: 删除带有异常值的数据; 将异常值视为缺失值,交给缺失值处理方法来处理; 用平均值进行修正; 当然我们也可以选择不处理。 基于密度的异常检测 基于密度的异常检测有一个先决条件,即正常的数据点呈现"物以类聚"的聚合形态,正常数据出现在密集的邻域周围,而异常点偏离较远。对于这种场景,我们可以计算得分来评估最近的数据点集,这种得分可以使用Eucledian距离或其它的距离计算方法,具体情况需要根据数据类型来定:类别型或是数字型。 iForest异常检测 iForest(isolation forest,孤立森林)算法是一种基于Ensemble的快速异常检测方法,具有线性时间复杂度和高精准度,该算法是刘飞博士在莫纳什大学就读期间由陈开明(Kai-Ming Ting)教授和周志华(Zhi-Hua Zhou)教授指导发表的,与LOF、OneClassSVM相比,其占用的内存更小、速度更快。算法原理如下:The code of the algorithm part is as follows:
Classifiers = {"IsolationForest": IsolationForest (n_estimators=100, max_samples=len (X), contamination=0.005,random_state=state, verbose=0)} def train_model (clf, train_X): # Fit the train data and find outliers clf.fit (X) # scores_prediction = clf.decision_function (X) y_pred = clf.predict (X) return y_pred, clfy_pred,clf = train_model (clf=classifiers ["IsolationForest"] Train_X=X) these are the unsupervised anomaly detection algorithms commonly used in big data. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.