How to carry on the data Mining to the website 04/24 Update SLTechnology News&Howtos

How to carry on the data Mining to the website

2025-04-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to mine the website". In the operation of the actual case, many people will encounter such a dilemma. Then let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

# what is machine learning?

With the continuous application of machine learning in practical industrial fields, the word has been given a variety of different meanings. The meaning of "machine learning" in this article is consistent with the explanation on wikipedia, as follows:

Machine learning is a scientific discipline that deals with the construction and study of algorithms that can learn from data.

Machine learning can be divided into unsupervised learning (unsupervised learning) and supervised learning (supervised learning). Supervised learning is a more common and valuable way in industry, which is mainly introduced in this way below. As shown in the following figure, supervised machine learning has two processes in solving practical problems. One is the offline training process (blue arrow), which includes data filtering and cleaning, feature extraction, model training and model optimization. The other process is the application process (green arrow), which extracts features from the data that need to be estimated, uses the model obtained by offline training to estimate, and obtains the estimated value in the actual product. In these two processes, offline training is the most technically challenging work (a lot of work in the online prediction process can reuse the work of the offline training process), so the offline training process is mainly introduced below.

# what is a model?

Model is an important concept in machine learning, which simply refers to the mapping from feature space to output space; it is generally composed of the hypothetical function and parameter w of the model (the following formula is an expression of the Logistic Regression model, which is explained in detail in the chapter of the training model); the hypothetical space of a model (hypothesis space) refers to the set of all possible w corresponding output spaces of a given model. The commonly used models in industry are Logistic Regression (LR for short), Gradient Boosting Decision Tree (GBDT for short), Support Vector Machine (SVM for short), Deep Neural Network (DNN for short) and so on.

Model training is based on the training data to obtain a set of parameters w to optimize a specific goal, that is, to obtain the optimal mapping from the feature space to the output space. See the training model chapter on how to achieve it.

# Why use machine learning to solve problems?

At present, in the era of big data, there are T-to-P data everywhere, and it is difficult to give full play to the value of these data by simple rule processing.

Cheap high-performance computing reduces the learning time and cost based on large-scale data

Cheap large-scale storage makes it possible to process large-scale data faster and less costly

There are a large number of high-value problems, so that after spending a lot of energy to solve problems with machine learning, we can get a lot of benefits.

# what problems should machine learning solve?

The goal problem needs to be of great value, because machine learning to solve the problem has a certain price.

There is a large amount of data available for the target problem, and a large amount of data can enable machine learning to solve the problem better (as opposed to simple rules or manual work).

The goal problem is determined by a variety of factors (characteristics), and the advantage of machine learning in problem solving can be realized (as opposed to simple rules or labor).

The goal problem needs to be continuously optimized, because machine learning can be self-learning and iterative based on data.

Model the problem

Taking the estimation of the transaction volume of DEAL (group purchase order) as an example, this paper introduces how to use machine learning to solve the problem (that is, to estimate how much a given DEAL will sell in a period of time). First of all, you need to:

Collect information about the problem, understand the problem, and become an expert on the problem.

Disassemble the problem, simplify the problem, and turn the problem into a problem that can be predicted by the machine.

After an in-depth understanding and analysis of the DEAL transaction volume, it can be broken down into several issues such as the following figure:

# single model? Multiple models? How to choose?

After disassembling according to the figure above, there are two possible modes to estimate the DEAL transaction volume, one is to estimate the transaction volume directly, and the other is to estimate each sub-problem, such as establishing a user number model and a visit rate model (the number of orders that users who visit the DEAL will buy), and then calculate the transaction volume based on the estimated values of these sub-issues.

Different methods have different advantages and disadvantages, as follows:

Which mode do you choose?

1) if the problem is predictable and difficult, then consider using multiple models.

2) the importance of the problem itself, if the problem is very important, consider using multiple models.

3) if the relationship between multiple models is clear or not, multiple models can be used.

If multiple models are used, how to integrate them?

Linear fusion or complex fusion can be carried out according to the characteristics and requirements of the problem. Taking this article as an example, there are at least two kinds of questions as follows:

# Model selection

For the problem of DEAL transaction volume, we think that it is very difficult to estimate directly, and we hope to break it into sub-problems, that is, multi-model model. That requires the establishment of a user number model and a visit rate model, because machine learning solves problems in a similar way, and the following article only takes the visit rate model as an example. To solve the problem of visit rate, we must first choose a model. We have the following considerations:

Main consideration

1) choose a model that is consistent with business objectives

2) choose the model which is consistent with the training data and characteristics.

With less training data and more High Level features, "complex" nonlinear models (popular GBDT, Random Forest, etc.) are used.

A large amount of training data and many Low Level features are used, so a "simple" linear model (popular LR, Linear-SVM, etc.) is used.

Supplementary consideration

1) whether the current model is widely used in industry

2) whether there is a relatively mature open source toolkit for the current model (within or outside the company)

3) whether the amount of data that the current toolkit can handle can meet the requirements

4) whether I know the current model theory, and whether I have used the model to solve the problem before.

To select the model for the actual problem, you need to transform the business goal of the problem into the model evaluation goal, and transform the model evaluation goal into the model optimization goal; according to the different business objectives, select the appropriate model, the specific relationship is as follows:

Generally speaking, the difficulty of predicting the real value (regression), the order of size (sort), and the correct interval (classification) of the target is from great to small, according to the needs of the application, choose the target with less difficulty as far as possible. For the application target of the visit rate estimation, we at least need to know the size order or the real value, so we can choose Area Under Curve (AUC) or Mean Absolute Error (MAE) as the evaluation target, and Maximum likelihood as the model loss function (that is, optimization target). To sum up, we choose the spark version of GBDT or LR based on the following considerations:

1) the sorting or regression problem can be solved.

2) We have implemented the algorithm ourselves, and we often use it with good results.

3) support massive data

4) widely used in industry.

Prepare training data

After an in-depth understanding of the problem, after selecting the corresponding model for the problem, we need to prepare data; data is the foundation of machine learning to solve the problem, and if the data is not selected correctly, the problem cannot be solved. Therefore, extra care and attention are needed to prepare training data:

# Note:

The distribution of the data to be solved is as consistent as possible.

The distribution of the training set / test set is as consistent as possible with the data distribution of the online prediction environment, where the distribution refers to the distribution of (xmemy), not just the distribution of y.

Y data noise as low as possible, try to eliminate y noisy data

Sampling is not necessary, sampling may often change the actual data distribution, but if the data is too large to train or there is a serious imbalance between positive and negative ratio (for example, more than 100 1), sampling is needed.

# Common problems and solutions

The data distribution of the problem to be solved is inconsistent:

1) the DEAL data may vary greatly in the visit rate, for example, the influencing factors or performance of gourmet DEAL and hotel DEAL are very inconsistent, which requires special treatment; either normalize the data in advance, or take the factors of inconsistent distribution as characteristics, or train the model separately for all kinds of DEAL.

The data distribution has changed:

1) the data training model from half a year ago is used to predict the current data, because the data distribution may change over time, and the effect may be very poor. Try to use recent data training to predict current data, historical data can be used to reduce the weight of the model, or do transfer learning.

The data is noisy:

1) when building the CTR model, take the Item that the user does not see as a negative example, these Item are not clicked because the user does not see it, not necessarily because the user does not like it and is not clicked, so these Item are noisy. Some simple rules can be used to eliminate these negative noise examples, such as the idea of skip-above, that is, on top of the Item that the user has clicked, and the Item that has not been clicked as a negative example (assuming that the user is browsing the Item from top to bottom).

The sampling method is biased and does not cover the entire collection:

1) in the problem of visit rate, if the DEAL of only one store is used for estimation, the DEAL of multiple stores can not be well estimated. It should be ensured that the DEAL data of one store and multiple stores are available.

2) there is no two-classification problem of objective data, positive / negative cases are obtained by rules, and the coverage of positive / negative cases is not comprehensive. The data should be randomly sampled and manually labeled to ensure that the distribution of the sampled data is consistent with the actual data.

# training data on the purchase rate

Collect N-month DEAL data (x) and corresponding purchase rate (y)

Collect the last N months, excluding unconventional times such as holidays (keep the distribution consistent)

Only collect DEAL with online duration > T and number of visitors > U (reduce y noise)

Consider the life cycle of DEAL sales (keep the distribution consistent)

Consider the differences in different cities, different business circles and different categories (keep the distribution consistent).

Feature extraction

After the completion of data filtering and cleaning, it is necessary to extract features from the data, that is, to complete the conversion from the input space to the feature space (see figure below). Different feature extraction is needed for linear model or nonlinear model, linear model needs more feature extraction work and skills, while nonlinear model requires relatively less feature extraction.

In general, features can be divided into High Level and Low Level,High Level, which refer to features with more general meanings, and Low Level refers to features with specific meanings, for example:

DEAL A1 belongs to POIA, with a per capita of less than 50 and a high purchase rate.

DEAL A2 belongs to POIA, with a per capita of more than 50 and a high rate of visits.

DEAL B1 belongs to POIB, with a per capita of less than 50 and a high purchase rate.

DEAL B2 belongs to POIB, with an average of more than 50 per capita and the lowest purchase rate.

Based on the above data, two characteristics can be drawn: POI (stores) or per capita consumption; POI is Low Level, and per capita consumption is High Level. Suppose the model is estimated as follows:

If DEALx belongs to POIA (Low Level feature), the visit rate is high.

If the per capita DEALx is less than 50 (High Level feature), the visit rate is high.

Therefore, on the whole, Low Level is more targeted, the coverage of a single feature is small (there is not much data containing this feature), and the number of features (dimensions) is very large. High Level is more general, the coverage of a single feature is large (a lot of data containing this feature), and the number of features (dimensions) is not large. The predicted value of long-tailed samples is mainly affected by High Level characteristics. The predicted value of high frequency samples is mainly affected by Low Level characteristics.

For the visit rate issue, there are a large number of characteristics of High Level or Low Level, some of which are shown in the following figure:

Characteristics of nonlinear model

1) High Level features can be mainly used, because the computational complexity is large, so the feature dimension should not be too high.

2) the target can be well fitted by High Level nonlinear mapping.

Characteristics of linear model

1) the feature system should be as comprehensive as possible, and both High Level and Low Level should have

2) High Level can be converted into Low Level to improve the fitting ability of the model.

# feature normalization

After feature extraction, if the value ranges of different features vary greatly, it is best to normalize the features in order to achieve better results. The common normalization methods are as follows:

Rescaling:

Normalize to [0jue 1] or [- 1pr 1], in a similar manner:

Standardization:

Let it be the mean value of the x distribution and the standard deviation of the x distribution.

Scaling to unit length:

Normalized to unit length vector

# feature selection

After feature extraction and normalization, if too many features are found that the model can not be trained, or it is easy to lead to over-fitting of the model, it is necessary to select features and select valuable features.

Filter:

Assuming that the influence of the feature subset on the model estimation is independent of each other, select a feature subset and analyze the relationship between the feature subset and the data Label. If there is some positive correlation, the feature subset is considered to be valid. There are many algorithms to measure the relationship between feature subset and data Label, such as Chi-square,Information Gain.

Wrapper:

Select a feature subset to add the original feature set, train with the model, and compare the effect before and after the subset is added. If the effect becomes better, the feature subset is considered to be effective, otherwise it is considered invalid.

Embedded:

Feature selection is combined with model training, such as adding L1 Norm and L2 Norm to the loss function.

Training model

After the completion of feature extraction and processing, we can start model training. The following article takes the simple and commonly used Logistic Regression model (hereinafter referred to as LR model) as an example to briefly introduce.

There are m training data, where x is the eigenvector and y is label.

W is the parameter vector in the model, that is, the object that needs to be learned in the model training.

The so-called training model is to select the hypothetical function and the loss function, and constantly adjust w based on the existing training data (xtraining y) to make the loss function optimal, and the corresponding w is the final learning result, so as to get the corresponding model.

# Model function

1) hypothetical function, that is, suppose there is a functional relationship between x and y:

2) loss function, based on the above hypothetical function, the model loss function (optimization objective) is constructed, which usually aims at the maximum likelihood estimation of (xPowery) in LR:

# Optimization algorithm

Gradient descent (Gradient Descent)

That is, w adjusts along the negative gradient direction of the loss function, as shown in the diagram below, the gradient is the first derivative (see formula below), and there are many types of gradient decline, such as random gradient decline or batch gradient decline.

Random gradient descent (Stochastic Gradient Descent), randomly select a sample at each step, calculate the corresponding gradient, and complete the update of w, as shown below

Batch gradient descent (Batch Gradient Descent), each step calculates the corresponding gradient of all samples in the training data, and w iterates along this gradient direction, that is,

Newton method (Newton's Method)

The basic idea of Newton method is to find the estimated value of the minimum point of L (w) by making the second-order Taylor expansion of the objective function near the minimum point. Figuratively speaking, make a tangent at wk, and the intersection of the tangent and L (w) = 0 is the next iteration point wk+1 (schematic diagram is shown below). The update formula of w is as follows, in which the second partial derivative of the objective function is the famous Hessian matrix.

Quasi-Newton method (Quasi-Newton Methods): it is difficult to calculate the second-order partial derivative of the objective function, but what is more complicated is that the Hessian matrix of the objective function can not be positively definite, and the positive definite symmetric matrix which can approximate the inverse of the Hessian matrix is constructed without the second-order partial derivative, so as to optimize the objective function under the condition of "quasi-Newton".

BFGS: use BFGS formula to approximate H (w). It needs H (w) in memory and O (m2) level in memory.

L-BFGS: a matrix that stores updates for a limited number of times (such as k times)

Using these update matrices to generate a new H (w), the memory is reduced to the O (m) level

OWLQN: if L1 regularization is introduced into the objective function, it is necessary to introduce a virtual gradient to solve the problem of non-differentiability of the objective function, and OWLQN is used to solve this problem.

Coordinate Descent

For w, for each iteration, other dimensions are fixed, and only one dimension is searched to determine the optimal descent direction (schematic diagram below). The formula is as follows:

Optimization model

After the above-mentioned data screening and cleaning, feature design and selection, model training, we get a model, but what if the effect is not good? What shall I do?

[first]

Reflect on the predictability of targets and the existence of bug for data and features.

[then]

Analyze whether the model is Overfitting or Underfitting, and optimize it from the aspects of data, features and models.

# Underfitting & Overfitting

The so-called Underfitting, that is, the model does not learn the internal relationship of the data, as shown on the left of the following figure, the generation of classification surface can not well distinguish between X and O data; the underlying reason is that the model assumption space is too small or the model hypothesis space deviation.

The so-called Overfitting, that is, the transition of the model fits the internal relationship of the training data, as shown in the right side of the following figure, the classification surface is too good to distinguish between X and O data, while the real classification surface may not be so that it does not perform well on non-training data; the underlying reason is the contradiction between the huge model hypothesis space and sparse data.

In actual combat, you can determine whether the current model is Underfitting or Overfitting based on the performance of the model on the training set and test set. The judgment method is as follows:

# how to solve Underfitting and Overfitting problems?

This is the end of the content of "how to mine the website". Thank you for your reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.