How to implement a complete anomaly detection algorithm from scratch in Python 07/03 Update SLTechnology News&Howtos

How to implement a complete anomaly detection algorithm from scratch in Python

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to implement a complete anomaly detection algorithm from scratch in Python". In daily operation, it is believed that many people have doubts about how to use Python to implement a complete anomaly detection algorithm from scratch. The editor consulted all kinds of materials and sorted out simple and useful operation methods. I hope it will be helpful to answer the question of "how to use Python to achieve a complete anomaly detection algorithm from scratch"! Next, please follow the editor to study!

Anomaly detection algorithm based on probability

Anomaly detection can be treated as a statistical task of outlier analysis. However, if we develop a machine learning model, it can be automated and save a lot of time as usual. There are many exception detection cases. Credit card fraud detection, faulty machine detection or hardware system detection based on its abnormal functions, and disease detection based on medical records are good examples. There are more use cases. And the use of exception detection will only increase.

Formula and process

Compared with other machine learning algorithms I explained earlier, this will be much simpler. The algorithm will use the mean and variance to calculate the probability of each training data.

If the probability of a training example is high, it is normal. If the probability of a training example is low, it is an exception example. The definitions of high probability and low probability will be different for different training sets. We will discuss how to determine later.

If I have to explain how anomaly detection works, it's very simple.

(1) use the following formula to calculate the average:

Where m is the length of the dataset or the number of training data, xi is a training example. If you have multiple training functions, then most of the time you will need to calculate the average for each function.

(2) use the following formula to calculate the variance:

Here, mu is the average calculated from the previous step.

(3) now, use this probability formula to calculate the probability of each training example.

Don't be confused by the plus sign in this formula! This is actually a change in the shape of the diagonal.

When we implement the algorithm later, you will see what it looks like.

(4) now we need to find the threshold of probability. As I mentioned earlier, if the probability of the training example is low, then this is an exception example.

What is the low probability?

There are no general restrictions. We need to find answers for our training data sets.

We get a series of probability values from the output obtained in step 3. For each probability, if the data is abnormal or normal, find the label.

Then calculate the accuracy of a series of probabilities, recall rates and F1 scores.

The accuracy can be calculated using the following formula

The recall rate can be calculated by the following formula:

In this case, "positive affirmation" refers to the number of cases in which the algorithm detects the example as an exception and is actually an exception.

False positives occur when the algorithm detects the example as an exception, but this is not the case.

False Negative indicates that the example detected by the algorithm is not an exception example, but in fact, it is an exception example.

From the above formula, you can see that higher accuracy and higher recall rate are always good, because it means that we have more positive advantages. But at the same time, as you can see in the formula, false positives and false positives also play a vital role. There needs to be a balance. Depending on your industry, you need to determine which one you can tolerate.

A good way is to take the average. There is a unique formula for averaging. This is the F1 score. The formula for F1 score is:

Here, P and R are precision and recall rates, respectively.

I will not elaborate on why the formula is so unique. Because this article is about anomaly detection. If you are interested in learning more about precision, recall rate, and F1 score, please check out the detailed article on this topic here:

Fully understand the concepts of precision, recall rate and F score

How to deal with skewed datasets in machine learning

Based on the F1 score, you need to select the threshold probability.

1 is the perfect f score, and 0 is the worst probability score.

Anomaly detection algorithm

I will use the dataset from Andrew Ng's machine learning course, which has two training functions. I did not use the real dataset in this article because it is very suitable for learning. It has only two functions. In any real-world dataset, there can't be only two functions.

Let's start the mission!

First, import the necessary software packages

Import pandas as pd import numpy as np

Import the dataset. This is an excel dataset. Here, training data and cross-validation data are stored in separate tables. So let's bring training data.

Df = pd.read_excel ('ex8data1.xlsx', sheet_name='X', header=None) df.head ()

Let's draw column 0 for column 1.

Plt.figure () plt.scatter (df [0], df [1]) plt.show ()

By looking at this figure, you may know which data is abnormal.

Check how many training examples are in this dataset:

M = len (df)

Calculate the average of each feature. We only have two functions here: 0 and 1.

S = np.sum (df, axis=0) mu = s/mmu

Output:

0 14.1122261 14.997711 dtype: float64

According to the formula described in the "formulas and procedures" section above, the variance is calculated:

Vr = np.sum ((df-mu) * * 2, axis=0) variance = vr/mvariance

Output:

0 1.8326311 1.709745 dtype: float64

Now make it diagonal. As I explained in the "formulas and processes" section of the probability formula, the summation symbol is actually the diagonal of the variance.

Var_dia = np.diag (variance) var_dia

Output:

Array ([1.83263141, 0. ], [0. , 1.70974533]])

Calculate probability:

K = len (mu) X = df-mu p = 1 / ((2*np.pi) * * (kmax 2) * (np.linalg.det (var_dia) * * 0.5)) * np.exp (- 0.5 * np.sum (X @ np.linalg.pinv (var_dia) * Xmemery axisym1)) p

The training part is complete.

The next step is to find out the threshold probability. If the probability is lower than the threshold probability, the sample data is abnormal data. But we need to find the threshold for our special case.

In this step, we use cross-validation data as well as tags. In this dataset, we have cross-validation data and tags in a separate worksheet.

In your case, you only need to retain a portion of the original data for cross-validation.

Now import the cross-validation data and tags:

Cvx = pd.read_excel ('ex8data1.xlsx', sheet_name='Xval', header=None) cvx.head ()

The label is:

Cvy = pd.read_excel ('ex8data1.xlsx', sheet_name='y', header=None) cvy.head ()

I convert "cvy" to NumPy arrays only because I like to use arrays. DataFrames is fine, too.

Y = np.array (cvy)

Output:

# Part of the array array ([[0], [0], [0], [0], [0], [0], [0], [0]

In this case, a value of 0 for "y" indicates that this is a normal example, while a value of 1 for y indicates an abnormal example.

Now, how do you choose the threshold?

I don't want to just check all the probabilities from the list of probabilities. That may not be necessary. Let's check the probability again.

P.describe ()

Output:

Count 3.070000e+02 mean 5.905331e-02 std 2.324461e-02 min 1.181209eMel 2325% 4.361075eMel 0250% 6.510144eMel 0275% 7.849532e-02 max 8.986095e-02 dtype: float64

As you can see in the picture, we don't have much abnormal data. Therefore, it would be good if we only started with a value of 75%. But to be safer, I'll start with averages.

Therefore, we will take a series of probabilities from the average to the lower range. We will check the F1 score for each probability in that range.

First, define a function to calculate true affirmation, error affirmation, and error negation:

Def tpfpfn (ep): tp, fp, fn = 0,0,0 for i in range (len (y)): if p [I]

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.