In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces Python how to achieve isolated random forest algorithm, the article is very detailed, has a certain reference value, interested friends must read it!
1 introduction
Isolated forest (isolation Forest) is an efficient anomaly detection algorithm, which is similar to random forest, but each selection of partition attributes and points (values) is random, rather than based on information gain or Gini index.
2 Overview of isolated random forest algorithm 2.1
Isolation, which means isolation / isolation, is a noun, and its verb isolate,forest is forest. Together, it means "isolated forest", and it is also called "isolated forest". It seems that there is no unified Chinese name. Maybe everyone is used to using its English name isolation forest, abbreviated to iForest.
IForest algorithm is removed by Zhou Zhihua of Nanjing University and Fei Tony Liu,Kai Ming Ting of Monash University in Australia to mine data. It is suitable for anomaly detection of continuous data (Continuous numerical data), which defines anomaly as "easily isolated outlier (more likely to be separated)", which can be understood as a point with sparse distribution and high density. In terms of statistics, sparsely distributed areas in the data space indicate that the probability of data occurring in this area is very low, so the data falling in these areas can be considered abnormal. It is usually used for attack detection and traffic anomaly analysis in network security, while financial institutions are used to mine fraud. For the abnormal data found, then either directly clear the abnormal data, such as denoising data in data cleaning, or deeply analyze the abnormal data, such as analyzing the behavior characteristics of attack and fraud.
2.2 introduction to the principle
IForest belongs to the methods of Non-parametric and unsupervised, that is, there is no definition of mathematical models and no marked training is required. IForest uses a very efficient strategy on how to find out which points are isolated easily. Suppose we use a random hyperplane to cut (split) the data space (data space), which can generate two subspaces at a time (cut the cake into two parts in detail). Then we continue to cut each subspace with a random hyperplane and cycle until there is only one data point in each subspace. Intuitively, we can find that those clusters with high density are divided many times before they stop cutting, but those with low density are easy to stop in a subspace a long time ago.
IForest algorithm benefits from the idea of random forest, like random forest is composed of a large number of decision trees, iForest forest is also composed of a large number of binary trees, the tree in iForest is called isolation tree, referred to as iTree,iTree tree is different from the decision tree, its construction process is also simpler than the decision tree, is a completely random process.
Assuming that there are N pieces of data in the data set, when an ITree is constructed, n samples are uniformly sampled from N pieces of data (usually non-return sampling) as the training samples of this tree. In the sample, a feature is randomly selected, and a value is randomly selected in all the value range of the feature (between the minimum value and the maximum value). The sample is divided into two parts, those less than this value in the sample are divided to the left side of the node, and those greater than or equal to this value are divided to the right side of the node. As a result, we get a split condition and the data set on the left and right sides, and then repeat the above process on the data set on the left and right sides, respectively, until the data set has only one record or reaches the limited height of the tree.
Because the abnormal data is small and the eigenvalues are very different from the normal data. Therefore, when you build an iTree, the abnormal data is closer to the root, while the normal data is farther away from the root. The result of an ITree is often unreliable, and the iForest algorithm constructs multiple binary trees through multiple sampling. Finally, the results of all trees are integrated, and the average depth is taken as the final output depth, from which the abnormal branches of the data points are calculated.
2.3 algorithm steps
How to cut this data space is the core idea of iForest design. This paper only studies the most basic method. Because the cutting is random, we need to use ensemble's method to get a convergence value (Monte Carlo method), that is, repeatedly starting from scratch, and then averaging the result of each cut. IForest consists of t iTree (Isolation Tree) isolated trees, and each iTree is a binary tree structure, so let's first talk about the construction of iTree trees, and then look at the construction of iForest trees.
3 parameter explanation
(1) n_estimators: how many itree,int,optional (default=100) to build to specify the number of random trees generated in the forest
(2) max_samples: the number of samples, which is automatically 256 default='auto.
The number of samples used to train random numbers, that is, the size of subsamples:
1) if an int constant is set, max_samples samples will be pulled from the total sample X to generate a tree iTree
2) if a float floating point number is set, then max_samples*X.shape [0] samples will be pulled from the total sample X, and X.shape [0] represents the total number of samples.
3) if "auto" is set, then max_samples=min (256, n_samples), n_samples is the total number of samples
If the max_samples value is larger than the total number of samples provided, all samples will be used to construct the number, which means that there is no sampling, and the constructed n_estimators ITree uses the same samples, that is, all samples.
(3) contamination:c (n) defaults to 0.1 float in (0,0.5), optional (default=0.1), with a value range of (0,0.5), indicating the proportion of abnormal data to a given dataset, that is, the number of contamination in the dataset. The function of defining this parameter value is to define the threshold in the decision function. If set to "auto", the threshold of the decision function is the same as in the paper, with a change in version 0.20: the default value is changed from 0.1 to 0.22 auto.
(4) max_features: maximum number of features. Default is 1 or float,optional. Specify the number of attributes taken from the total sample X to train the iTree of each tree. Only one attribute is used by default.
If set to int integer, max_features attributes are extracted
If it is a float floating point number, extract max_features * X.shape [1] attributes
(5) bootstrap:boolean,optional (default = False). When building a Tree, whether to replace the sample next time. If True is the replacement, each tree can sample the training data back to the ground. If False is not replaced, the sampling that will not be put back is performed.
(6) n_jobs:int or None, optional (default = None), the number of jobs running in parallel when running the fit () and predict () functions. Except in the case of joblib.parallel_backend context, None is represented as 1, and a setting of-1 indicates the use of all available processors
(7) the behavior of the behaviour:str,default='old', decision function decision_function, which can be "old" and 'new'. Setting to behavior='new' will allow decision_function to cater to the API of other anomaly detection algorithms, which will be set to the default value in the future. As explained in detail in the offset_ properties document, decision_function becomes dependent on the contamination parameter, using 0 as its natural threshold for detecting outliers.
The New in version 0.20:behaviour parameter was added to version 0.20 for backward compatibility
Behaviour='old' is deprecated in version 0.20 and will not be used in version 0.22
The behaviour parameter will be deprecated in version 0.22 and removed in version 0.24
(8) random_state:int,RandomState instance or None,optional (default=None)
If set to int constant, the random_state parameter value is the seed for the random number generator
If set to a RandomState instance, the random_state is a random number generator
If set to None, the random number generator uses the RandomState instance in np.random
(9) verbose:int,optional (default=0) control tree construction process is lengthy.
(10) warm_start:bool,optional (default=False), when set to TRUE, reuse the result of the last call to fit and add more trees to the previous forest 1 collection; otherwise, fit a whole new forest
4 Python code implementation # _ coding:utf-8_*_ # ~ ~ Welcome to the official account: beauty of Power system and algorithm ~ # ~ Import related Library ~ import numpy as npimport matplotlib.pyplot as pltfrom pylab import * import matplotlib Matplotlib.use ('TkAgg') mpl.rcParams [' font.sans-serif'] = ['SimHei'] mpl.rcParams [' axes.unicode_minus'] = Falsefrom sklearn.ensemble import IsolationForest # isolated random forest rng = np.random.RandomState (42) # this square method is the pseudorandom number generation method in np, where 42 represents the seed, as long as the pseudorandom number sequence generated by the seed is consistent. # ~ generate training data ~ X = 0.3 * rng.randn (100,2) # randn: standard normal distribution The random sample of rand is located in [0,1) X_train = np.r_ [X + 2, X-2] X = 0.3 * rng.randn (20,2) X_test = np.r_ [X + 2, X-2] X_outliers = rng.uniform (low=-4, high=4, size= (20,2)) # ~ training model ~ clf = IsolationForest (max_samples=100,random_state=rng Contamination='auto') clf.fit (X_train) y_pred_train = clf.predict (X_train) y_pred_test = clf.predict (X_outliers) xx, yy = np.meshgrid (np.linspace (- 5,5,50), np.linspace (- 5,5,50)) Z = clf.decision_function (np.c_ [xx.ravel () Yy.ravel ()]) Z = Z.reshape (xx.shape) # ~ Visualization ~ plt.title ("isolated random forest") plt.contourf (xx, yy, Z, camp=plt.cm.Blues_r) b1 = plt.scatter (X_train [:, 0], X_train [:, 1], cymbals greenlies, slots 20, edgecolor='k') b2 = plt.scatter (X_test [:, 0], X_test [:, 1], cymbals white.' Song20, edgecolor='k') c = plt.scatter (X_outliers [:, 0], X_outliers [:, 1], plt.axis ('tight') plt.xlim ((- 5,5)) plt.ylim ((- 5,5)) plt.legend ([b1, b2, c], ["training observations", "new regular observations") "new abnormal observations"], loc= "upper left") plt.show () 5 results
These are all the contents of the article "how to implement isolated Random Forest algorithm in Python". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.