How to understand the system classical model Wide and Deep 07/19 Update SLTechnology News&Howtos

How to understand the system classical model Wide and Deep

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly introduces "how to understand the system classical model Wide and Deep". In the daily operation, I believe that many people have doubts about how to understand the system classical model Wide and Deep. The editor consulted all kinds of materials and sorted out the simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to understand the system classical model Wide and Deep"! Next, please follow the editor to study!

Abstract

In large-scale feature scenarios, we usually use the practice of applying nonlinear features to linear models (before 2016). In this way, our input will be a very sparse vector. Although we want to achieve such nonlinear features, through some feature transformation and feature crossing methods can be achieved, but this will require a lot of manpower and material resources.

As a matter of fact, we have mentioned this problem before when we introduced the FM model. For the FM model, it actually solves the same problem. Only the solution is different. The method of FM model is to introduce a parameter matrix V of n x k to calculate the weight of pairwise intersection of all features to reduce the number of parameters and improve the efficiency of prediction and training. In this paper, we discuss the use of neural networks to solve this problem.

The core of solving the problem is that the literal translation of embedding,embedding is embedded, but this is not easy to understand. Generally speaking, we can understand it as a vector representation of some features. For example, in Word2Vec, what we do is to represent a word as a vector. These vectors are called word embedding. One characteristic of embedding is that the length is fixed, but the value is generally learned through neural networks.

We can use the same way to train embedding to train some feature embedding in the neural network, so that the workload of feature engineering is greatly reduced. However, it is not possible to use embedding alone, because overfitting may occur in some scenes, so we need to combine linear features with sparse features, so that the model will not fall into over-fitting, and we can have enough ability to learn better results.

Brief introduction

As we shared in previous articles, the recommendation system can also be seen as a search sorting system. Its input is a user information and the context information that the user browses, and the result returned is an ordered sequence.

Because of this, the recommendation system will face a similar challenge to the search sorting system-the tradeoff between memory and generalization. Memory can be simply understood as a kind of learning in pairs of goods or features. Because the historical behavior characteristics of users are very strong, memory can bring better results. But at the same time, there will be problems, the most typical problem is that the generalization ability of the model is not enough.

For generalization ability, its main source is the correlation and transitivity between features. It is possible that feature An and B are directly related to label, or feature An is related to feature B, and feature B is related to label, which is called transitivity. Using the transitivity between features, we can explore some feature combinations that rarely appear in some historical data, so as to obtain strong generalization ability.

Linear models such as LR are widely used in large-scale online recommendation and sorting systems, because these models are very simple, scalable, powerful, and well interpretable. These models are often trained with binary data such as one-hot, for example, if the user installs netflix, the user_installed_app=netflix feature is 1, otherwise it is 0. Therefore, some second-order features are highly explainable.

For example, if the user has also browsed Pandora, then the joint feature of user_installed_app=netflix,impression_app=pandora is 1, and the weight of the joint feature is actually the correlation between the two. However, such features require a lot of manual operation, and because of the sparsity of the samples, for some combinations that do not appear in the training data, the model can not learn their weights.

But this problem can be solved by embedding-based models, such as the FM model introduced earlier, or deep neural networks. It can train the low-dimensional embedding and use the embedding vector to calculate the weight of the cross feature. However, if the features are very sparse, it is difficult to guarantee the effect of the generated embedding. For example, the preference of users is more obvious, or the products are relatively minority, in this case, most of the pair pairs of query-item have no behavior, but the weight calculated by embedding may be greater than 0, which leads to over-fitting and inaccurate recommendation results. For this special case, the fitting and generalization ability of the linear model is better.

In this paper, we will introduce the Wide & Deep model, which is compatible with memory and generalization in a model. It can train both the linear model and the neural network at the same time to achieve better results.

The main contents of this paper are as follows:

Wide & Deep model, including feedforward neural network embedding part and linear model feature transformation, application in generalized recommendation system

Implementation and Evaluation of Wide & Deep Model in Google Play scenario, Google Play is a mobile App store with more than 1 billion daily active users and 100w App

Overview of recommendation system

This is a classic architecture diagram of the recommendation system:

When a user accesses the app store, a request is generated that contains the characteristics of the user and context. The recommendation system returns a series of app, all of which are model-filtered app that users may click on or buy. When the user sees this information, there will be some behaviors, such as browsing (no behavior), clicking, buying, and generating behavior, and the data will be recorded in Logs and become training data.

Let's take a look at the upper part, that is, from DataBase to Retrieval. Due to the large amount of data in Database, there are millions of them. So it is impossible for us to score all app invocation models within a specified period of time (10 milliseconds) and then sort them. So we need to Retrieval the request, that is, recall. The Retrieval system will recall the user's request, and there are many ways to recall, either by using machine learning models or by rules. Generally speaking, it is first quickly screened based on rules, and then filtered by machine learning model.

After the end of filtering and retrieval, the Wide & Deep model is called to estimate the CTR, and these APP are sorted according to the predicted CTR. In this paper we also ignore other technical details and focus only on the implementation of the Wide & Deep model.

Wide & Deep principle

First, let's take a look at the structure diagrams of commonly used models in the industry:

This diagram comes from the paper and shows the Wide model, Wide & Deep model and Deep model from left to right. We can also see from the diagram that the so-called Wide model is actually a linear model, and the Deep model is a deep neural network model. The following combined with this picture to make a detailed introduction of these two parts.

Wide part

The Wide part is actually a generalized linear model, as shown on the left side of the image above. Y is the result we want to predict, x is the feature, it is a d-dimensional vector.

. D here is the number of features. Similarly, w is also a d-dimensional weight vector, and b is the offset. These have been introduced in previous linear regression models, and we should be no stranger to them.

The feature consists of two parts, one is the data directly from the original data, and the other is the feature we get after feature transformation. One of the most important ways of feature transformation is cross-combination, which can be defined as follows:

Here is a Bool type variable, which represents the result of the k th conversion function of the I th characteristic. Because the product is used, the final result is 1 only if all items are true, otherwise it is 0. For example, "AND (gender=female,language=en)" is a cross feature, and the result will be 1 only if the user's gender is female and the language used is English at the same time. In this way, we can capture the interaction between features and add nonlinear features to the linear model.

Deep part

The Deep part is a feedforward neural network, which is the right part of the image above.

If we look at this picture, we will find a lot of details. For example, its input is a feature of sparse, which can be simply understood as an array of multihot. This input is converted into a low-dimensional embedding at the first layer of the neural network, and then the neural network trains the embedding. This module is mainly designed to deal with some category features, such as item category, user gender and so on.

Compared with the traditional one-hot method, embedding's method uses a vector to represent a discrete variable, its expression ability is stronger, and the value of this vector is to let the model learn, so the generalization ability is also greatly improved. This is also a common practice in deep neural networks.

Wide & Deep merger

Once both the Wide part and the Deep part are in place, they are merged by weighting. This is the middle part of the picture above.

Before the top output, it is actually a sigmoid layer or a linear layer, which is a simple linear accumulation. The English name joint,paper also enumerates the difference between joint and ensemble. For the ensemble model, each part of it is trained independently. Different parts of the joint model are jointly trained. The parameters of each part of the ensemble model do not affect each other, but for the joint model, the parameters are trained at the same time.

The result is that because the training is separate for each part, the parameter space of each sub-model is very large, so that better results can be obtained. The way of joint training does not have this problem, we separate the linear part and the deep learning part, we can complement the shortcomings between them, so as to achieve better results, and do not have to artificially expand the number of training parameters.

System realization

The data flow recommended by app consists of three parts: data production, model training and model service. Use a picture to show something like this:

Data production

In the data production phase, we use app to expose in front of the user for a period of time as a sample. If the app is clicked by the user to install, then the sample is marked as 1, otherwise marked as 0. This is also the practice in most recommended scenarios.

At this stage, the system looks up the table and converts some string category features into int-type id. For example, the corresponding 1 of the entertainment category, the corresponding 2 of the photography category, such as 0 of the charge, 1 of the free, and so on. At the same time, the characteristics of digital types will be standardized and scaled to the range of [0,1].

Model training

A structure diagram of the model is provided in paper:

As we can see from the image above, on the left are some continuous features, such as age, the number of app installed, etc., and on the right are some discrete features, such as device information, installed app, and so on. These discrete features are converted into embedding and then entered into the neural network to learn together with the continuous features on the right. 32-dimensional embedding is used in paper.

The model will use more than 500 billion of samples for each training, and will train the model every time new training data are collected. But if we start from scratch in every training, it will obviously be very slow and waste a lot of computing resources. Therefore, paper chooses an incremental update mode, that is, when the model is updated, the parameters of the old model will be loaded, and then the latest data will be used for update training. Before the new model update comes online, it will verify the effect of the model and confirm that there is no problem before updating it.

Model service

When the model is trained and loaded, for each request, the server will obtain a series of candidate app from the recall system, as well as the characteristics of the user. Then the model is called to score each app, and after the score is obtained, the server sorts the candidate app according to the score from high to low.

In order to ensure the response ability of the server and return the results within 10ms time, paper adopts the method of multi-thread concurrent execution. To be honest, I think this figure is a little false. Because the current model does not use concurrent execution, but even if it is concurrent execution, it is difficult to use deep learning for prediction to achieve this level of efficiency. Some other optimizations may have been used, but not all of them are written in paper.

Model result

In order to verify the effectiveness of the Wide & Deep model, paper conducted a large number of tests from two angles in the real scene. Including the amount of app obtained and the performance of the service.

App acquisition quantity

The online environment carried out a 3-week Ahand B test, with 1 barrel as the control barrel, using the previous version of the linear model. One bucket uses the Wide & Deep model, and the other bucket uses only the Deep model, removing the linear part. These three buckets each account for 1% of the traffic, and the final results are as follows:

The Wide & Deep model not only has a higher AUC, but also increases access to online APP by 3.9%.

Service performance

For the recommendation system, the performance of the server has always been a big problem, because it not only needs to carry a lot of traffic, but also needs to ensure that the latency is very short. The complexity of using machine learning or deep learning model to predict CTR is very high. According to paper, their servers carry 10 million qps at peak times.

If it takes 31 milliseconds to use a single thread to process a batch data, in order to speed up, they developed a multi-thread scoring mechanism, and split a batch into several parts for concurrent calculation. In this way, the latency of the client is reduced to 14 milliseconds.

Code implementation

Just talking without practicing fake tricks, Wide & Deep once performed well in the field of recommendation, and the implementation of the model was not complicated. I have used Pytorch to achieve a simple version, posted to attract jade for everyone to do a reference.

Import torch from torch import nn class WideAndDeep (nn.Module): def _ init__ (self, dense_dim=13, site_category_dim=24, app_category_dim=32): super (WideAndDeep, self). _ _ init__ () # Linear part self.logistic = nn.Linear (19,1, bias=True) # embedding part self.site_emb = nn.Embedding (site_category_dim 6) self.app_emb = nn.Embedding (app_category_dim, 6) # Fusion part self.fusion_layer = nn.Linear (12,6) def forward (self, x): site = self.site_emb (x [:,-2] .long () app = self.app_emb (x [: -1] .long () emb = self.fusion_layer (torch.cat ((site, app), dim=1)) return torch.sigmoid (self.logistic (torch.cat ((emb, x [:,:-2]), dim=1)

Because my application scenario at that time is relatively simple, so the network structure is only three layers, but the principle is the same, if you want to apply to complex scenarios, only need to add features and network levels.

At this point, the study on "how to understand the classic model Wide and Deep of the system" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.