Detailed explanation of ten statistical techniques that data scientists need to master 07/01 Update SLTechnology News&Howtos

Detailed explanation of ten statistical techniques that data scientists need to master

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Detailed explanation of ten statistical techniques that data scientists need to master

Https://mp.weixin.qq.com/s/eRBYjneWBTu6ep4UNGNUuw

Author James Le

Compile Lu Xue, Liu Xiaokun, Jiang Siyuan

This article is reproduced from Machine Heart (almosthuman2014). Reprint requires authorization.

"data scientists are better at statistics than programmers and better at programming than statisticians. This paper introduces ten statistical techniques that data scientists need to master, including linear regression, classification, resampling, dimensionality reduction, unsupervised learning and so on.

No matter what your attitude towards data science, it is impossible to ignore the importance of analyzing, organizing and combing data. Glassdoor has compiled a list of the "25 best jobs in the United States" based on feedback from employers and employees, and the top one is a data scientist. Although the ranking is already at the top, the work of data scientists will not stop there. As technologies such as deep learning become more and more common, and hot areas such as deep learning are more and more concerned by researchers and engineers and the enterprises that employ them, data scientists continue to be at the forefront of innovation and technological progress.

While strong programming skills are important, data science is not all about software engineering (in fact, familiarity with Python is enough to meet programming needs). Data scientists need to have both programming, statistics and critical thinking skills. As Josh Wills said, "data scientists are better at statistics than programmers and better at programming than statisticians." "I know a lot of software engineers who want to transform into data scientists, but they blindly use machine learning frameworks such as TensorFlow or Apache Spark to process data without fully understanding the statistical theory behind it. Therefore, they need to systematically study statistical machine learning, which is derived from statistics and functional analysis, and combines many disciplines such as information theory, optimization theory and linear algebra.

Why do you study statistics? It is important to understand the ideas behind different technologies, which can help you understand how and when to use them. At the same time, it is also important to accurately evaluate the performance of a method, because it can tell us how a method performs on a particular problem. In addition, statistical learning is also an interesting research field, which has important applications in the fields of science, industry and finance. Finally, statistical learning is a basic part of training modern data scientists. Classic research topics on statistical learning methods include:

Linear regression model

Perceptual machine

K-nearest neighbor method

Naive Bayesian method

Decision tree

Logistic regression to maximum Entropy Model

Support vector machine

Lifting method

EM algorithm

Hidden Markov model

Conditional random field

After that, I will introduce 10 statistical techniques to help data scientists deal with large data sets more efficiently. Before that, I would like to clarify the difference between statistical learning and machine learning:

Machine learning is a branch biased towards artificial intelligence.

Statistical learning method is a branch of biased statistics.

Machine learning focuses more on large-scale applications and prediction accuracy.

The Department of Statistics focuses on models and their interpretability, as well as accuracy and uncertainty.

The distinction between the two is becoming increasingly blurred.

1. Linear regression.

In statistics, linear regression predicts target variables by fitting the best linear relationship between dependent variables and independent variables. The best fitting is achieved by minimizing the sum of the distance between the predicted linear expression and the actual observation results. No other position generates fewer errors than this shape, and from this point of view, the fitting of the shape is "best". The two main types of linear regression are simple linear regression and multiple linear regression.

Simple linear regression uses an independent variable to predict the change of dependent variables by fitting the best linear relationship. Multiple linear regression uses multiple independent variables to predict the trend of dependent variables by fitting the best linear relationship.

Choose two related objects for daily use at will. For example, I have data on monthly expenses, monthly income and monthly trips over the past three years. Now I need to answer the following questions:

How much will I spend next year?

Which factor (monthly income or number of trips per month) is more important in determining monthly expenditure?

What is the relationship between monthly income, number of trips per month and monthly expenditure?

two。 classification

Classification is a data mining technology that assigns categories to data to help with more accurate prediction and analysis. Classification is an efficient method to analyze large data sets. The two main classification techniques are logistic regression and discriminant analysis (Discriminant Analysis).

Logistic regression is suitable for regression analysis when the dependent variable is binary. Like all regression analyses, logistic regression is a predictive analysis. Logistic regression is used to describe data and explain the relationship between bivariate dependent variables and one or more independent variables that describe the characteristics of things. The types of problems that logistic regression can detect are as follows:

The effect of each pound of weight exceeding the standard weight or every pack of cigarettes per day on the risk of lung cancer (yes or no).

Do calorie intake, fat intake and age affect heart disease (yes or no)?

In discriminant analysis, two or more sets and clusters can be used as prior categories, and then one or more new observations are classified into known categories according to the characteristics of the measure. Discriminant analysis models the predictor distribution X in each corresponding class respectively, and then uses Bayesian theorem to evaluate the probability of the corresponding category according to the value of X. This kind of model can be linear discriminant analysis (Linear Discriminant Analysis) or quadratic discriminant analysis (Quadratic Discriminant Analysis).

Linear discriminant analysis (LDA): a "discriminant value" is calculated for each observation to classify the class of response variables it belongs to. These scores can be obtained by finding a linear connection of independent variables. It assumes that the observations of each category are obtained from the multivariate Gaussian distribution, and the covariance of the predictor variable is common in all k levels of the response variable Y.

Quadratic discriminant analysis (QDA): provides another method. Similar to LDA, QDA assumes that the observations for each category of Y are obtained from the Gaussian distribution. However, unlike LDA, QDA assumes that each category has its own covariance matrix. That is, predictor variables are not common at all k levels of Y.

3. Resampling method

The resampling method (Resampling) includes extracting duplicate samples from the original data samples. This is a nonparametric method of statistical inference. That is, resampling does not use a general distribution to approximately calculate the value of probability p.

Resampling generates a unique sampling distribution based on the actual data. It uses empirical methods rather than analytical methods to generate the sample distribution. Resampling obtains unbiased estimates based on unbiased samples of all possible results of the data. To understand the concept of resampling, you should first understand self-help (Bootstrapping) and cross-validation (Cross-Validation):

The self-help method (Bootstrapping) is suitable for a variety of situations, such as verifying the performance of predictive models, integration methods, deviation estimates and model variances. It samples data by performing put-back sampling in the original data, using "unselected" data points as test samples. We can do this multiple times and then calculate the average as an estimate of the performance of the model.

Cross-validation is used to verify the performance of the model and is performed by dividing the training data into k parts. We use the kmur1 part as the training set and the "set aside" part as the test set. The step is repeated k times, and finally the average value of k times score is taken as the performance estimation.

Generally speaking, for linear models, the ordinary least square method is the main standard when fitting data. The following three methods can provide better prediction accuracy and model interpretability.

4. Subset selection

This method will select a subset of p predictive factors, and we believe that the subset is very related to the problem to be solved, and then we can use the subset feature and the least square method to fit the model.

Selection of the best subset: we can fit a separate OLS regression for each combination of p predictive factors, and then examine the fitting of each model. The algorithm is divided into two stages: (1) fitting all models containing k predictive factors, where k is the maximum length of the model; (2) using cross-validation to predict loss to select a single model. It is important to use validation or test errors, and you cannot simply use training errors to evaluate the fit of the model, because RSS and R ^ 2 increase monotonously as variables increase. The best way is to cross-validate the model through the highest R ^ 2 and the lowest RSS in the test set.

Forward step-by-step selection considers a smaller subset of p predictors. It starts from the model with predictive factors and gradually adds predictive factors to the model until all predictive factors are included in the model. The order of adding predictive factors is determined according to the degree to which different variables improve the fitting performance of the model, and we will add variables until no more predictive factors can improve the model in the cross-validation error.

Backward step-by-step selection starts with all p predictors in the model, and then iteratively removes the least useful predictor, one at a time.

The hybrid method follows the forward step-by-step method, but after each new variable is added, the method may also remove variables that are useless for model fitting.

5. Shrinkage

This method involves modeling using all p prediction factors. however, the coefficients that estimate the importance of prediction factors will shrink to zero according to the least square error. This contraction, also known as regularization, aims to reduce variance to prevent overfitting of the model. Because we use different contraction methods, some variables are estimated to return to zero. So this method can also perform variable selection, and the most common techniques for shrinking variables to zero are Ridge regression and Lasso regression.

Ridge regression is very similar to the least square method, except that it estimates coefficients by minimizing a slightly different value. Ridge regression, like OLS, seeks to reduce the coefficient estimation of RSS. However, when the coefficient contraction approaches zero, they all punish the contraction. We do not need mathematical analysis to see that Ridge regression is very good at shrinking features to the smallest possible space. Such as principal component analysis, Ridge regression projects the data into D-dimensional space, shrinks the components with lower variance in the coefficient space and retains the components with higher variance.

Ridge regression has at least one disadvantage, it needs to include all p predictors of the final model, mainly because the penalty will make the coefficients of many predictors close to zero, but will not be equal to zero. This is usually not a problem for prediction accuracy, but it makes the results of the model more difficult to explain. Lasso overcomes this shortcoming because it can force the coefficients of some predictors to return to zero when it is small after group s. Because s = 1 will lead to normal OLS regression, and when s approaches 0, the coefficient will shrink to zero. So Lasso regression is also a good way to perform variable selection.

6. Dimension reduction

The dimensionality reduction algorithm simplifies the problem of 1 coefficient of pairing to the problem of 1 coefficient of M, where M

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.