Web Security and Machine Learning (KNN) 07/03 Update SLTechnology News&Howtos

Web Security and Machine Learning (KNN)

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Network Security >

Shulou(Shulou.com)06/01 Report--

Sklearn's knn nearest neighbor algorithm for anomaly detection

Data source: Masquerading User Data in the http://www.schonlau.net/ web page. It contains the operation logs of 50 users, each containing 1500 operation commands, the first 5000 are normal operations, and the next 10000 logs contain abnormal operations. Refer to "introduction to Web Security Machine Learning"

Basic knowledge

I originally wanted to write it myself, but there are too many materials on the Internet, so there is no need to write. Just take a look at what other bosses have written:

Https://blog.csdn.net/zgcr654321/article/details/85219121https://www.jianshu.com/p/3dcb39de04aa

If you look at these two, basically there will be no problem.

Programming realization

Step 1: choose a random user log, where each line represents a command. Every 150 commands are made into an operation sequence and saved in the list.

Def load_user (filename): most_cmd = [] mini_cmd = [] cmd_list = [] cmd_seq = [] # get the operation sequence with open (filename,'r') Encoding= "utf-8") as f: cmd = f.readline () temp = [] cnt = 0 while (cmd): cmd_list.append (cmd.strip ('n')) temp.append (cmd.strip ('n')) cnt = cnt + 1 if (cnt = 150): # this is not according to the score in the book I follow a sequence of 150 commands here, which happens to match the tag, because the tag has only 100 values cmd_seq.append (temp) cnt = 0 temp = [] cmd = f.readline ()

Step 2: then count all the commands in the user log and count the 50 commands that are the most frequent and the 50 commands that are the least frequent

# get the most frequent top 50 commands, get the least frequent first 50 commands fdist = sorted (FreqDist (cmd_list). Items (), key = operator.itemgetter (1), reverse = True) # sort by frequency most_cmd = [item [0] for item in fdist [: 50]] mini_cmd = [item [0] for item in fdist [- 50:]]

Step 3: characterization. On the operation sequence of Step 1, we take an operation series as a unit, and ① counts the number of commands that are not repeated, the 10 commands that are most frequent in ②, and the 10 commands that are the least frequent in ③.

User_feature = [] for cmd_list in user_cmd_list: # get the number of commands that are not repeated in each sequence seq_len = len (set (cmd_list)) # arrange each sequence according to its occurrence frequency from high to low: fdist = sorted (FreqDist (cmd_list). Items (), key=operator.itemgetter (1) Reverse=True) seq_freq = [item [0] for item in fdist] # get the first 10 commands f2 = seq_freq [: 10] f3 = seq_freq [- 10:]

Step 4: because KNN can only accept numeric type input. In Step 4, ② and ③ are both string commands, and we need to scalarize them. Scalable method: calculate the degree of coincidence between the most frequently used 50 commands and the least frequently used 50 commands

# calculate coincidence degree f2 = len (set (f2) & set (user_max_freq)) f3 = len (set (f3) & set (user_min_freq)) # merge features: the number of commands per sequence that are not repeated in ①; the coincidence degree of the top 10 most frequent commands in each sequence of ② and the most frequent 50 commands in user; the coincidence degree of the top 10 commands in each sequence and the top 50 commands in user The complete code of user_feature.append ([seq_len, f2, f3]) python3 is as follows: from nltk.probability import FreqDist # Statistical Command occurrence Frequency import operatorfrom sklearn.neighbors import KNeighborsClassifierimport numpy as npdef load_user (filename): most_cmd = [] mini_cmd = [] cmd_list = [] cmd_seq = [] # get operation sequence with open (filename,'r') Encoding= "utf-8") as f: cmd = f.readline () temp = [] cnt = 0 while (cmd): cmd_list.append (cmd.strip ('\ n')) temp.append (cmd.strip ('\ n')) Cnt = cnt + 1 if (cnt = = 150): # not according to the score in the book I am here in a sequence of 150 commands, which happens to match the tag, because the tag has only 100values cmd_seq.append (temp) cnt = 0 temp = [] cmd = f.readline () # to get the first 50 commands most frequently Get the first 50 commands fdist = sorted (FreqDist (cmd_list). Items (), key = operator.itemgetter (1), reverse = True) # sort by frequency most_cmd = [item [0] for item in fdist [: 50]] mini_cmd = [item [0] for item in fdist [- 50:]] return cmd_seq, most_cmd, mini_cmddef get_user_feature (user_cmd_list User_max_freq User_min_freq): user_feature = [] for cmd_list in user_cmd_list: # get the number of commands that are not repeated in each sequence seq_len = len (set (cmd_list)) # arrange each sequence according to the occurrence frequency from high to low: fdist = sorted (FreqDist (cmd_list). Items () Key=operator.itemgetter (1) Reverse=True) seq_freq = [item [0] for item in fdist] # get the first 10 commands f2 = seq_freq [: 10] f3 = seq_freq [- 10:] # calculate the coincidence degree f2 = len (set (f2) & set ( User_max_freq)) f3 = len (set (f3) & set (user_min_freq)) # merge feature: the number of commands that are not repeated per sequence of ① The coincidence degree of the top 10 commands in each sequence of ② and the most frequent commands in user; the coincidence degree of the top 10 commands in each sequence of ③ and the top 50 commands in user User_feature.append ([seq_len, f2, f3]) return user_featuredef get_labels (filename): # get the label of the third column labels = [] cnt = 0 with open (filename,'r' Encoding= "utf-8") as f: temp = f.readline (). Strip ('\ n') while (temp): labels.append (int (temp [4])) cnt + = 1 temp = f.readline (). Strip ('\ n') return labelsif _ _ name__ = = "_ _ main__": user_cmd_list User_max_freq, user_min_freq = load_user ('user.txt') user_feature = get_user_feature (user_cmd_list, user_max_freq) User_min_freq) labels = get_labels ('labels.txt') # cut dataset: training set and test set x_train = user_feature [0:70] y_train = labels [0:70] x_test = user_ feature [70:] y_test = labels [70:] # training data neight = KNeighborsClassifier (n_neighbors=3 ) neight.fit (x_train Y_train) # Forecast y_predict = neight.predict (x_test) # calculated score score = np.mean (y_test = = y_predict) * 100 print (score) # 90.0

In the end, the correct rate is 90%.

Summary

① gets the top 50 commands most frequently, but not the top 50 most frequently in the book. I changed the code here. ② tags and data do not match, a total of 15000 commands, only 100 tags. The practice in the book is that every 100 is an operation sequence, that is, there are 150 operation sequences, and then 50 tags are added in front of the label. My code is to take 150 commands as a sequence, which is just right.

Finally, I recommend the following personal blog: https://unihac.github.io/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.