R language data Mining practice Series (5) 04/27 Update SLTechnology News&Howtos

R language data Mining practice Series (5)

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

R language data Mining practice Series (5)-- Mining Modeling

I. Classification and prediction

Classification and prediction are the two main types of prediction problems, the classification is mainly the prediction classification label (discrete attribute), while the prediction is mainly to establish a continuous value function model to predict the value of the dependent variable corresponding to a given independent variable.

1. Realization process

(1) Classification

Classification is to construct a classification model, input the attribute value of the sample, output the corresponding category, and map each sample to the predefined category. The classification model is based on the data set marked by the existing class, and the accuracy of the model on the existing samples can be easily calculated, so the classification belongs to supervised learning.

(2) forecasting

Prediction is to establish a functional model in which two or more variables depend on each other, and then predict or control it.

(3) implementation process

The classification model is divided into two steps: the first step is the learning step, through the induction and analysis of the training sample set to establish the classification model to get the classification rules; the second step is the classification step, which first uses the known test sample set to evaluate the accuracy of the classification rules. if the accuracy is acceptable, the model is used to predict the sample set of unknown class labels.

The realization of the prediction model is also divided into two steps: the first step is to establish the functional model of the prediction attribute (numerical) through the training set, and the second step is to predict or control after the model has passed the test.

two。 Commonly used classification and prediction algorithms

Table 5-1 A brief introduction to common classification and prediction algorithms

Algorithm name algorithm description regression analysis is the most commonly used statistical method to determine the quantitative relationship between predictive attributes (numerical) and other variables. Including linear regression, nonlinear regression, Logistic regression, ridge regression, principal component regression, partial least square regression and other models, the decision tree adopts top-down recursive mode, compares the attribute values in the internal node, and branches down from the node according to different attribute values. The final leaf node is the learning division of the artificial neural network. The artificial neural network is an information processing system established by imitating the structure and function of the brain neural network. It represents the model between the input and output variables of the neural network. Bayesian network, also known as the belief network, is an extension of the Bayes method. Support vector machine is one of the most effective theoretical models in the field of uncertain knowledge representation and reasoning. Support vector machine (SVM) is an algorithm for linear analysis in high-dimensional space by transforming low-dimensional nonlinear separability into high-dimensional linear separability through some kind of nonlinear mapping.

3. Regression analysis.

Table 5-2 Classification of commonly used regression models

The name of regression model uses conditional algorithm to describe the linear relationship between linear regression dependent variable and independent variable to model the linear relationship between one or more independent variables and dependent variables. the least square method can be used to solve the model coefficient nonlinear regression dependent variable and independent variable is not all linear to model the nonlinear relationship between one or more independent variables and dependent variables. If the nonlinear relationship can be transformed into a linear relationship by a simple function transformation, it can be solved by the idea of linear regression. If it can not be transformed, using the nonlinear least square method to solve the Logistic regression dependent variable generally has 1 and 0 (whether or not) is a special case of the generalized linear regression model, and the logistic function is used to control the value range of the dependent variable between 0 and 1. It is an improved method of least square estimation that there is multiple collinearity between the independent variables involved in the modeling of probabilistic ridge regression with a value of 1. principal component analysis is based on the idea of principal component analysis. it is an improvement of the least square method, which is a biased estimation of parameter estimation. It can eliminate the multiple collinearity between independent variables.

4. Decision tree

The decision tree is a tree structure, each leaf node corresponds to a classification, and the non-leaf node corresponds to the division on an attribute, which is divided into several subsets according to the different values of the sample on the attribute. For impure leaf nodes, the labeling of most classes gives the class to which the sample arriving at this node belongs. The core problem of constructing a decision tree is how to select the appropriate attributes to split the sample at each step. For a classification problem, learning and constructing a decision tree from the training samples marked by known classes is a top-down, divide-and-conquer process.

Table 5-3 decision tree algorithm classification

The core of the ID3 algorithm is to use the information gain method as the attribute selection criterion on all nodes of the decision tree to help determine the appropriate attribute C4.5 algorithm C4.5 decision tree generation algorithm compared with the ID3 algorithm. An important improvement of the C4.5 decision tree generation algorithm is the use of information gain rate to select node attributes. C4.5 algorithm can overcome the shortcomings of ID3 algorithm: ID3 algorithm is only suitable for discrete description attributes, while C4.5 algorithm can deal with both discrete description attributes and continuous description attributes CART decision tree is a very effective non-parametric classification and regression method. A binary tree is constructed by building tree, pruning tree and evaluation tree. When the final node is a continuous variable, the tree is a regression tree; when the final node is a classification variable, the tree is a classification tree.

5. Artificial neural network

Artificial neural network (ANN) is a mathematical model that simulates biological neural network for information processing. Artificial neuron is the basic information processing unit of artificial neural network operation. The learning of artificial neural network, also known as training, refers to a process in which the neural network adjusts the parameters of the neural network under the stimulation of the external shoulder and neck to make the neural network respond to the external environment in a new way. In the classification and prediction, the artificial neural network mainly uses the guided learning method, that is, according to the given training samples, adjust the parameters of the artificial neural network to make the network output close to the known sample class markers or other forms of dependent variables.

Whether the neural network training completes the common error function (also known as the objective function) E is measured. The training of neural network is stopped when the error function is less than a set value.

To use the artificial neural network model, it is necessary to determine the topological structure of the network connection, the characteristics of neurons and learning rules. The commonly used artificial neural network algorithms for classification and prediction are as follows:

Table 5-4 artificial neural network algorithm

Algorithm name algorithm description

BP neural network

BP neural network is a multilayer feedforward network trained by error back propagation algorithm. The learning algorithm is δ learning rule, and it is one of the most widely used neural network models at present. LM neural network LM neural network is a multilayer feedforward network based on gradient descent method and Newton method. It has the advantages of less iterations and fast convergence. High accuracy RBF radial basis neural network RBF radial basis neural network can approach any continuous function with arbitrary precision. The transformation from the input layer to the hidden layer is nonlinear, while the transformation from the hidden layer to the output layer is linear, which is especially suitable for solving the classification problem. FNN fuzzy neural network FNN fuzzy neural network is a neural network with fuzzy weight coefficients or fuzzy input signals. It is the product of the combination of fuzzy system and neural network, it gathers the points of neural network and fuzzy system, and integrates association, recognition, self-adaptation and fuzzy information processing into one GMDH neural network, also known as polynomial network. It is a commonly used neural network for prediction in feedforward neural networks. Its characteristic is that the network structure is not fixed, and the ANFIS adaptive neural network is embedded in a whole fuzzy structure during the training process. It learns from the training data unconsciously, automatically generates, modifies and highly generalizes the membership functions and fuzzy rules of the best input and output variables. In addition, the structure and parameters of each layer of the neural network also have a clear and easy to understand physical meaning.

The characteristic of BP (Back Propagation, back propagation) algorithm is to use the error after output to estimate the error of the direct leading layer of the transport layer, and then use this error to estimate the error of the previous layer, so that the back propagation of one layer can obtain the error estimates of all other layers. In this way, the error shown by the output layer is transferred step by step to the input layer of the network in the direction opposite to the input transmission.

The BP algorithm only uses the information of the first derivative (gradient) of the mean square error function to the weight and threshold, which makes the algorithm have some defects such as slow convergence speed and easy to fall into local minima.

6. Evaluation of classification and prediction algorithm

In order to effectively judge the performance of a prediction model, we need a set of datasets that do not participate in the establishment of the prediction model, and evaluate the accuracy of the prediction model on the dataset. This independent data set is called the test set. The prediction effect of the model is usually evaluated by absolute error and relative error, average absolute error, mean square error, root mean square error and so on.

(1) absolute error and relative error

If Y is the actual value and Y ^ represents the predicted value, then E is called the absolute error. The calculation formula is as follows: e is the relative error, and the formula is: e=E/Y

(2) mean absolute error (MAE)

(3) mean square error (MSE) is the average of the sum of squares of prediction errors, which avoids the problem that positive and negative errors can not be added.

(4) the root mean square error (RMSE) is the square root of the mean square error, which represents the discrete degree of the predicted value, which is also called the standard error. The best fitting condition is RMSE=0.

(5) the average absolute percentage error (MAPE). It is generally considered that the MAPE is less than 10:00 and the prediction accuracy is higher.

(6) Kappa statistics

Kappa statistics is a statistical index that compares whether two or more observers' observations of the same thing or two or more observations of the same thing are consistent, taking the difference between the consistency caused by opportunities and the consistency of actual observations as the basis of evaluation. Kappa statistics and weighted Kappa statistics can not only be used to test the consistency and reproducibility of disordered and ordered classification variables, but also give a "quantity" value that reflects the consistency.

The values of Kappa are between [- 1 and 1], and the values have different meanings:

Kappa=1: indicates that the results of the two judgments are exactly the same.

Kappa=-1: indicates that the results of the two judgments are completely inconsistent.

Kappa=0: it shows that the result of two judgments is caused by opportunity.

Kappa0: it makes sense at this point. The larger the Kappa, the better the consistency

Kappa ≥ 0.75: indicates that a fairly satisfactory degree of consistency has been achieved

KappaB) = P (A ∩ B)

If itemset An occurs, the probability of itemset B occurrence is the confidence of the association rule:

Confidence (A = > B) = P (B | A)

Minimum support and minimum confidence

The minimum support is a threshold defined by the user or expert to measure the support, indicating the minimum importance of the project set in the statistical sense, and the minimum confidence is a threshold defined by the user or expert to measure the confidence. Represents the minimum reliability of association rules. A rule that satisfies both the minimum support threshold and the minimum confidence threshold is called a strong rule.

Itemset

An itemset is a collection of items. An itemset containing k items is called a k itemset. The occurrence probability of an itemset is the count of all transactions that contain an itemset, also known as absolute support or support count. If the relative support of itemset I satisfies a predefined minimum support threshold, then I is a frequent itemset. Frequent k itemsets are usually denoted as Lk.

Support count

The support count of itemset An is the number of transactions in the transaction dataset containing itemset A, which is referred to as the frequency or count of itemsets.

Given the support count of itemsets, the support and confidence of rule A = > B can be easily derived from the support counts of all transaction counts, itemset An and itemset A ∩ B.

Support (A = > B) = (number of transactions that occur simultaneously in A ∩ B) / (number of all transactions) = (Support_count (A ∩ B)) / (Total_count (A))

Confidence (A = > B) = P (A | B) = (Support (A ∩ B)) / (Support (A)) = Support_count (A ∩ B) / Support_count (A)

That is, once you get the support count of all transactions, A, B, and A ∩ B, you can derive the corresponding association rules A = > B and B = > A, and check whether the rule is strong.

(2) Apriori algorithm: using candidates to generate frequent itemsets

The main idea of Apriori algorithm is to exceed the maximum frequent itemsets existing in the transaction data set and generate strong association rules by using the maximum frequent itemsets and the preset minimum confidence threshold.

The properties of Apriori

All non-empty subsets of frequent itemsets must also be frequent itemsets. According to this property, it can be concluded that if transaction An is added to an itemset that is not a frequent itemset I, the new itemset I ∩ A must not be a frequent itemset.

Two processes in the implementation of Apriori algorithm

a. Find out all frequent itemsets (support must be greater than or equal to a given minimum support threshold), and in this process, join step and pruning step merge each other, and finally get the largest frequent itemset Lk.

Join step: the purpose of the join step is to find k itemsets.

Pruning step: followed by the join step, which plays the purpose of reducing the search space in the process of generating candidate Ck.

b. Strong association rules are generated from frequent itemsets: from process a, itemsets that do not exceed the predetermined minimum scale threshold have been eliminated, and if the remaining rules meet the predetermined minimum confidence threshold, then strong association rules are mined.

IV. Time series mode

A group of random variables in chronological order, X1 Magi X2 xt,t=1,2,...,n, are commonly used to represent the time series of a random event, abbreviated as {xt,t=1,2,...,n}, and the n ordered observations of the random series are denoted by x1 Magi X2 Magi. It is called a sequence of observations with sequence length n.

Time series algorithm

Table 5-10 Common time series models

Model name description smoothing method is often used in trend analysis and prediction. Smoothing technique is used to weaken the influence of short-term random fluctuations on the sequence and make the sequence smoothed. According to the different smoothing techniques used, it can be divided into moving average method and exponential smoothing method. Trend fitting method takes time as independent variable and corresponding sequence observation as dependent variable to establish regression model. According to the characteristics of the sequence, it can be divided into linear fitting and curve fitting. Combination model

The change of time series is mainly affected by four factors: long-term trend (T), seasonal variation (S), periodic variation (C) and irregular variation (ε). According to the characteristics of the sequence, the addition model and multiplication model can be constructed.

Addition model: x1 extra Ttals staged Ct + ε t

Multiplication model: xt=Tt+St+Ct+ ε t

AR model

X1 = φ 0 + φ 1x (tmur1) + φ 2x (tmur2) +. + φ PX (tmurp) + ε t

The sequence value xt-1,xt-2,...xt-p of the previous p period is an independent variable, and the random variable Xt value xt is a dependent variable to establish a linear regression model.

MA model

Xt= μ + ε t-θ 1 ε (tmur1)-θ 2 ε (tmur2) -...-θ Q ε (tmurq)

The value xt of random variable Xt has nothing to do with the sequence values of previous periods, and the linear regression models of xt and random disturbances ε (tmur1), ε (tmurq), ε (tmurq) of the previous Q period are established.

The value xt of ARMA model xt= φ 0 + φ 1x (tmur1) + φ 2x (tmur2) +... + φ PX (tmurp) + ε t-θ 1 ε (tmur1)-θ 2 ε (tmur2)-- θ Q ε (tmurq) random variable xt is not only related to the sequence value of the previous p period, but also to the random disturbance of the previous p period. Many non-stationary sequences in ARIMA model will show the properties of stationary series after difference, which is called differential stationary series. The differential stationary series can be fitted by ARIMA model. ARCH model ARCH model can accurately simulate the volatility of time series variables, which is suitable for heteroscedasticity and short-term autocorrelation of heteroscedasticity function. GARCH model and derivative model GARCH model are called generalized ARCH model, which is the extension of ARCH model. Compared with the ARCH model, the GARCH model and its derivative model can better reflect the long-term memory and information asymmetry in the actual sequence. two。 Preprocessing of time series

After getting an observation sequence, we must first test its pure randomness and stationarity, these two important tests become the preprocessing of the sequence.

For the pure random sequence, also known as the white noise sequence, there is no correlation between the items of the sequence, and the sequence is in a completely disordered random fluctuation, which can terminate the analysis of the sequence.

For stationary non-white noise series, its mean and variance are constant, and there is a set of very mature modeling methods for stationary series. Usually a linear model is established to fit the development of the sequence. ARMA model is the most commonly used stationary series fitting model.

For non-equilibrium series, because its mean and variance are uncertain, the processing method is generally to transform it into stationary series, so that the analysis methods of stationary time series can be applied. If a time series is stationary after differential operation, it is called differential stationary series, which can be analyzed by ARIMA model.

3. Stationary time series analysis

The full name of ARMA model is autoregressive moving average model, which is the most commonly used model to fit stationary series. It can be subdivided into AR model, MA model and ARMA model, all of which can be regarded as multiple linear regression model.

4. Non-stationary time series analysis

The analysis methods of non-stationary time series can be divided into two categories: deterministic factor decomposition time series analysis and random time series analysis.

5.R language main temporal pattern algorithm function

The time series pattern algorithm realized by R language is mainly ARIMA model. When using this model for modeling, a series of discriminant operations are needed, including stationarity test, white noise test, difference test, AIC and BIC index values, model order determination, and finally prediction.

Table 5-11 list of temporal pattern algorithm functions

The function name function belongs to the package acf () to calculate the autocorrelation coefficient, draw the autocorrelation coefficient diagram R language general function pacf () to calculate the partial correlation coefficient, draw the partial phase relation number diagram R language general function unitrootTest ()

Unit root test of observation sequence fUnitRootsdiff () differential calculation of observation sequence R language general function armasubsets () model order determination of modeling parameters of time series patterns, creation of regression time series model TSAarima () to set modeling parameters of time series patterns Create ARIMA time series model or transform regression time series model into ARIMAX model R language general function Box test () detect whether ARIMA model conforms to white noise test R language general function forecast () apply the constructed time series model for prediction outlier detection

The task of outlier detection is to find objects that are significantly different from most other objects.

The main results are as follows: (1) the causes of outliers: data sources and different categories, natural variation, data measurement and collection errors.

(2) types of outliers

Table 5-12 rough classification of outliers

Classification standard classification description from the data range of global outliers and local outliers from the overall point of view, some objects do not have favorable group characteristics, but from the local point of view, it shows a certain outlier. From data types, numerical outliers and classified outliers, which are divided by the attribute type of the dataset, the number of one-dimensional outliers and multi-dimensional outliers an object may have one or more attributes 1. Outlier detection method

Table 5-13 commonly used outlier detection methods

Evaluation of outlier detection methods based on statistics most outlier detection methods based on statistics construct a probability distribution model and calculate the probability that the object conforms to the model. The premise of the outlier detection method based on statistical model is to know what distribution the data set obeys. For high-dimensional data, the test effect may be very poor. Based on proximity, proximity measures can usually be defined between data objects, and objects far away from most points can be regarded as outliers. Simple, two-dimensional or three-dimensional data can be observed by scatter plots; big data set is not applicable; sensitive to parameter selection It has a global threshold and can not deal with datasets with different density regions based on the fact that there may be different density regions in the dataset based on density. from the point of view of density, outliers are objects in low-density regions. The outlier score of an object is the inverse of the density around the object, which gives a quantitative measure that the object is an outlier, and even if the data has different regions, it can be handled well; big data set is not applicable; parameter selection is difficult based on clustering. A method of using clustering to detect outliers is to discard small clusters far away from other clusters. Another more systematic approach is to first cluster all objects, and then evaluate the extent to which objects belong to clusters (outlier score). It may be highly effective to find outliers based on clustering techniques. The quality of clusters produced by clustering algorithm has a great influence on the quality of outliers generated by this algorithm. Density-based outlier detection is closely related to outlier detection based on proximity, because density is commonly defined as the reciprocal of the average distance to K nearest neighbors. If the distance is small, the density is high. The other is to use the DBSCAN algorithm, where the density around an object is equal to the number of objects within the specified distance d of the object.

two。 Model-based outlier detection algorithm

A data model is established by estimating the parameters of the probability distribution. If a data object can not fit the model well, that is, if it is likely not to obey the distribution, then it is an outlier.

(1) outlier detection in univariate normal distribution

(2) outlier detection of hybrid model

Mixing is a special statistical model, which uses several statistical distributions to model data. Each distribution corresponds to a cluster, and the parameters of each distribution provide a description of the corresponding cluster, usually described by center and divergence.

The hybrid model regards the data as a set of observations obtained from different probability distributions. The probability distribution can be any distribution, but it is usually multivariate normal.

Generally speaking, the mixed model data generation process is as follows: given several distributions of the same type but different parameters, randomly select a distribution and generate an object from it. Repeat the process m times, where m is the number of objects.

In clustering, the hybrid model method assumes that the data come from the mixed probability distribution, and each cluster can be identified by one of these distributions. Similarly, for outlier detection, the data is modeled by a hybrid model of two distributions, one of which is normal data and the other is outlier.

The goal of clustering and outlier detection is to estimate the parameters of the distribution to maximize the total likelihood of the data.

This paper provides a simple method for outlier detection: first put all data objects into the normal data set, when the outlier set is empty, and then use an iterative process to transfer the data objects from the normal data set to the outlier set. as long as the transfer can improve the total likelihood of the data.

(3) outlier detection method based on clustering.

a. Discard small clusters away from other clusters: in general, this process can be simplified to discard all clusters that are less than a minimum threshold. This method can be used with any other clustering technique, but requires a threshold of minimum cluster size and distance between small clusters and other clusters. Moreover, this scheme is highly sensitive to the choice of the number of clusters, so it is difficult to attach outliers to the object using this scheme.

b. Prototype-based clustering

Another more systematic approach is to first cluster all objects and then evaluate the extent to which objects belong to clusters (outlier scores). In this method, the degree of belonging to a cluster can be measured by the distance from the object to the center of its cluster. In particular, if the deletion of an object results in a significant improvement in the goal, the object can be considered an outlier.

For prototype-based chicken ribs, there are mainly two methods to evaluate the degree to which the object belongs to the cluster (outlier score): one is to measure the distance from the object to the cluster prototype and use it as the outlier score of the object; the other is to consider the different density of the cluster. the relative distance from the cluster to the prototype can be measured, which is the ratio of the distance from the point to the centroid and the median distance from the point to the centroid in the cluster.

Clustering is carried out. Select the clustering algorithm, cluster the samples into K clusters, and find the centroids of each cluster.

Calculates the distance of each object to its nearest centroid.

Calculates the relative distance of each object to its nearest centroid.

Compared with a given threshold.

If the distance of an object is greater than this threshold, the object is considered to be an outlier.

The improvements of outlier detection based on clustering are as follows:

The influence of outliers on initial clustering: outliers are detected by clustering, and outliers will affect the clustering results. In order to deal with this problem, the following methods can be used: object clustering, deletion of outliers, and re-clustering of objects (which is not guaranteed to produce optimal results).

A more complex approach: take a set of special objects that do not fit any cluster well, and this group of objects represent potential outliers. With the progress of clustering process, clusters are changing. Objects that no longer strongly belong to any cluster are added to the potential outlier set; while the object currently in the collection is tested, and if it now strongly belongs to a cluster, it can be removed from the potential outlier set. The points that remain in the set at the end of the clustering process are classified as outliers

Whether an object is considered an outlier may depend on the number of clusters. One strategy is to repeat the analysis for different number of clusters; the other is to find a large number of small clusters, and the idea is:

Smaller clusters tend to be more cohesive.

If an object is an outlier when there are a large number of small clusters, it is probably a true outlier.

The downside is that a group of outliers may form small clusters to evade detection.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.