In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >
Share
Shulou(Shulou.com)06/01 Report--
The Application of Deep Learning in the ranking of Meituan Dianping recommended platform
Pan Hui
Meituan Dianping as the largest life service platform in China, with the rapid development of business, the number of users and businesses of Meituan Dianping is growing rapidly. In this context, through the optimization of the recommendation algorithm, it has become a top priority to create a recommendation system with high accuracy, high richness and delight for users. In order to achieve this goal, we are constantly trying to introduce new algorithms and new technologies into the existing framework.
This article is authorized and forwarded by the official account of Wechat, Meituan Dianping's technical team.
Introduction
Since the 2012 ImageNet contest, deep learning has become the most concerned technology in the field of machine learning and artificial intelligence in recent years. Before the emergence of deep learning, people use SIFT, HOG and other algorithms to extract features with good differentiation, and then combine machine learning algorithms such as SVM for image recognition. However, the feature extracted by algorithms such as SIFT is limited, which leads to the error rate of the best result of the competition at that time is more than 26%. The debut of convolutional neural network (CNN) reduced the error rate from 26% to 15%. According to a paper published by the Microsoft team in the same year, the error rate of the ImageNet 2012 dataset can be reduced to 4.94% through deep learning.
In the following years, deep learning has made remarkable progress in many application fields, such as speech recognition, image recognition, natural language processing and so on. In view of the potential of deep learning, major Internet companies have also invested resources to carry out scientific research and application. Because people realize that in the era of big data, more complex and powerful depth models can profoundly reveal the complex and rich information contained in massive data, and make more accurate predictions of future or unknown events.
As an Internet company that has been committed to standing at the forefront of science and technology, Meituan Dianping has also made some explorations in deep learning, including in the field of natural language processing, we apply deep learning technology to text analysis, semantic matching, search engine sorting model, etc.; in the field of computer vision, we apply it to text recognition, image classification, image quality ranking and so on. This paper is the author's team, which draws lessons from the idea of Wide & Deep Learning put forward by Google in 2016, and makes some thinking and practical experience on Dianping recommendation system based on some characteristics of its own business.
Introduction of comment and recommendation system
Different from most recommendation systems, Meituan Dianping's scene because of the diversity of its own business, it is difficult for us to accurately capture users' points of interest or users' real-time intentions. And the scenarios we recommend will also change with the user's interest, location, environment, time, and so on. The review recommendation system is mainly faced with the following challenges:
Diversity of business forms: in addition to recommending merchants, we also make real-time judgments according to different scenarios, so as to launch different forms of business, such as group lists, hotels, scenic spots, overlord meals and so on.
Diversity of user consumption scenarios: users can choose to spend at home: takeout, store consumption: group order, coupon, or travel consumption: hotel reservation, etc.
To solve the above problems, we customize a set of perfect recommendation system framework, including multi-choice recall and sorting strategy based on machine learning, as well as recommendation engine from massive big data offline computing to highly concurrent online services. The strategy of the recommendation system is mainly divided into two processes: recall and sorting. Recall is mainly responsible for generating recommended candidate sets, and sorting is responsible for personalized sorting the results of multiple algorithm strategies.
Recall layer: we make real-time judgment through user behavior, scenarios, etc., and recall different candidate sets through multiple recall strategies. Then fuse the recalled candidate set. Candidate set fusion and filtering layer have two functions, one is to improve the coverage and accuracy of recommendation strategy, and the other is to undertake certain filtering responsibilities to formulate some artificial rules from the point of view of product and operation to filter out unqualified Item. Here are some of the recall strategies we often use:
User-Based collaborative filtering: find out the N User most similar to the current User X, and estimate the score of the Item according to the score of the N User. In terms of similarity algorithm, we use Jaccard Similarity:
Model-Based collaborative filtering: use a set of hidden factors to contact users and products. Each user and each commodity is represented by a vector, and the evaluation of the commodity I by the user u is obtained by calculating the inner product of the two vectors. The key of the algorithm is to estimate the hidden factor vectors of users and commodities according to the known behavior data of users to goods.
Item-Based collaborative filtering: we first use word2vec to take the vector of each Item's hidden space, and then use Cosine Similarity to calculate the similarity between each Item used by user u and unused Item I. Finally, the results of Top N are recalled.
Query-Based: a policy triggered by abstracting the user's intention according to the real-time information contained in Query (such as geographic location information, WiFi to store, keyword search, navigation search, etc.).
Location-Based: the location of mobile devices often changes, and different geographical locations reflect different user scenarios, which can be fully utilized in specific business. In the recommended candidate set recall, we will also trigger the corresponding policy according to the user's real-time geographic location, place of work, place of residence and other geographical location.
Sort layer: each type of recall strategy will recall certain results, these results need to be sorted uniformly after they are removed. The framework for ranking recommendations can be roughly divided into three parts:
Offline computing layer: offline computing layer mainly includes algorithm set, algorithm engine, responsible for data integration, feature extraction, model training, and offline evaluation.
Near-line real-time data stream: the main purpose is to subscribe and predict the behavior of different user streams, and use various data processing tools to clean the original log, process it into formatted data, and land in different types of storage systems. for downstream algorithms and models.
Online real-time scoring: according to the user's scene, extract the corresponding features, and use a variety of machine learning algorithms to fuse and rearrange the results of multi-strategy recall.
The specific recommended flow chart is as follows:
From the perspective of the overall framework, every time a user requests, the system will write the current requested data to the log, and use various data processing tools to clean, format and land the original log into different types of storage systems. In the training, we use feature engineering to select the training and test sample set from the processed data set, and use it to train and estimate the offline model. We use a variety of machine learning algorithms and evaluate their performance through offline AUC, NDCG, Precision and other indicators. After training and evaluation, if the offline model has been significantly improved in the test set, it will be online for online AB testing. At the same time, we also have multi-dimensional reports to support the model on data.
Application of Deep Learning in the ranking system of comments and recommendations
For the candidate sets generated by different recall strategies, it is a bit simple and rough to determine the location of the Item generated by the algorithm according to the historical effect of the algorithm. at the same time, within each algorithm, the order of different Item is simply determined by one or more factors. These sorting methods can only be used in the first step of the primary selection process, and the final sorting results need the help of machine learning. Using the relevant ranking model, a variety of factors are integrated to determine.
1 introduction of the existing sorting framework
So far, the review recommendation ranking system has tried a variety of linear, nonlinear, mixed model and other machine learning methods, such as logical regression, GBDT, GBDT+LR and so on. Through online experiments, it is found that compared with linear models, traditional nonlinear models such as GBDT may not be able to significantly improve the prediction of CTR in online AB testing. Linear models, such as logical regression, because of their weak nonlinear performance ability, can not distinguish the nonlinear scenes in real life, and often over-memorize the data that have appeared in the historical data. The following is a linear model that ranks some historical clicks at the top of the list based on memory:
From the picture, we can see that the system recommends some long-distance merchants in a very front position, because these merchants have been clicked by users, and their own click-through rate is high, so it is easy to be recommended by the system again. However, this recommendation does not recommend some novel Item to users according to the current scenario. In order to solve this problem, we need to consider more and more complex features, such as combining features to replace simple "distance" features. How to define and combine features is a costly process and depends more on human experience.
The deep neural network can learn the relationship between some Item and features through low-dimensional and dense features, and greatly reduce the demand for feature engineering compared with the linear model, which attracts us to explore and research.
In practical application, according to the Wide-Deep Learning model proposed by Google in 2016, and combined with the needs and characteristics of our own business, we integrate linear model components and deep neural network to form a wide-depth learning framework for memory and generalization in one model. In the following chapters, we will discuss how to carry out sample screening, feature processing, deep learning algorithm implementation and so on.
2 screening of samples
Data and characteristics are the two most important links in the whole machine learning, because they themselves determine the upper limit of the whole model. Due to the characteristics of its own multi-business (including takeout, merchant, group purchase, wine travel, etc.) and multi-scene (user-to-store, user at home, remote request, etc.), our sample set is more diversified than other products. Our goal is to predict users' click behavior. Those with clicks are positive samples and those without clicks are negative samples. at the same time, the purchased samples are weighted to a certain extent during training. Moreover, in order to prevent over-fitting / underfitting, we control the ratio of positive and negative samples to 10%. Finally, we need to clean the training samples and remove the Noise samples (when the eigenvalues are similar or the same, they correspond to positive and negative samples respectively).
At the same time, the recommendation business, as the core module of the whole App home page, has a high demand for novelty and diversity. In the implementation of the review recommendation system, the first step is to determine the data of the application scenario. Meituan Dianping's data can be divided into the following categories:
User profile: gender, resident, price preference, Item preference and so on.
Item Portrait: contains merchants, takeout, group orders and other Item. Among them, merchant characteristics include: merchant price, merchant praise, merchant geographical location and so on. Takeout features include: average takeout price, delivery time, takeout sales and so on. The characteristics of the group order include: the number of people who are applicable to the group order, the visit rate of the group order, and so on.
Scene portrait: the user's current location, time, location of the nearby business district, user-based context scene information, etc.
3 feature processing in deep learning
Another core area of machine learning is feature engineering, including data preprocessing, feature extraction, feature selection and so on.
1. Feature extraction: the process of constructing new features from the original data. The method includes calculating various simple statistics, principal component analysis and unsupervised clustering. After the construction method is determined, it can be turned into an automatic data processing flow, but the core of the feature construction process is still manual.
two。 Feature selection: select a few useful features from many features. Features and redundant features that are not related to learning goals need to be eliminated, and some unimportant features need to be discarded if computing resources are insufficient or the complexity of the model is limited. The commonly used feature selection methods are as follows:
The cost of feature selection is high and the cost of feature construction is high. In the initial stage of recommendation business, we do not have a strong feeling about this aspect. However, with the development of business, the demand for the click-through rate prediction model is getting higher and higher, and the huge investment in feature engineering can no longer meet our needs, so we want to find a new solution.
Deep learning can automatically combine and transform the low-order features of the input to get the characteristics of high-order features, which also urges us to turn to deep learning to explore. The advantages of in-depth learning "automatic feature extraction" have different performances in different fields. For example, for image processing, pixels can be used as low-order feature input, and the high-order features automatically obtained by convolution layer have better results. In natural language processing, some semantics come not from data, but from people's prior knowledge. It is very helpful to use the features constructed by prior knowledge.
Therefore, we hope to save the huge investment in feature engineering with the help of deep learning, and make more click-through rate prediction models and auxiliary models automatically complete feature construction and feature selection, and always be consistent with business goals. Here are some feature processing methods we use in in-depth learning:
Combinatorial feature
For the processing of features, we follow the current methods commonly used in the industry, such as normalization, standardization, discretization and so on. But it is worth mentioning that we introduce many combined features into the model training. Because the combination of different features is very effective and can be well explained, for example, we combine "whether the merchant is always resident", "whether the user is resident" and "the current distance between the merchant and the user". Then discretize the data, through the combination of features, we can grasp the internal relations in the discrete features and add more nonlinear representations to the linear model. Composite features are defined as:
Normalization
Normalization processes data according to the rows of the feature matrix, and its purpose is that the sample vector has a unified standard when calculating similarity by point multiplication or other kernel functions, that is to say, it is transformed into "unit vector". In practical engineering, we use two normalization methods:
Min-Max:
Min is the minimum value of this feature and Max is the maximum value of this feature.
Cumulative Distribution Function (CDF): CDF is also called cumulative distribution function, which means the probability that a random variable is less than or equal to one of its values x. The formula is as follows:
In our offline experiment, the continuous feature after CDF processing is less than 0.1% higher than the offline AUC of Min-Max,CDF. We guess that because some continuous features do not satisfy the random function which is uniformly distributed on (0J1), CDF is not as intuitive and effective as Min-Max in this case, so we use Min-Max method on-line.
Rapid polymerization
In order to make the model converge faster and give the network a better representation, we set its super-liner and sub-liner for each continuous feature, that is, for each feature x, two sub-features are derived:
The experimental results show that the performance of offline AUC can be improved by introducing two sub-features to each continuous variable, but considering the problem of online computation, these two sub-features are not added in the online experiment.
4. The choice of Optimizer
In deep learning, choosing an appropriate optimizer will not only speed up the whole neural network training process, but also avoid getting stuck at the saddle point in the training process. In this paper, combined with their own use, put forward some of their own understanding of the used optimizer.
Stochastic Gradient Descent (SGD)
SGD is a common optimization method, that is, the gradient of Mini-Batch is calculated each iteration, and then the parameters are updated. The formula is as follows:
The disadvantage is that there is a serious oscillation for the loss equation and it is easy to converge to the local minimum.
Momentum
In order to overcome the serious problem of SGD oscillation, Momentum introduces the concept of momentum in physics into SGD and replaces the gradient by accumulating the previous momentum. That is:
Compared with SGD,Momentum, it is equivalent to walking down the hillside continuously. When there is no resistance, its momentum will increase, but if it encounters resistance, its speed will become smaller. In other words, in the training, in the dimension with the same gradient direction, the training speed becomes faster and the gradient direction changes, the update speed becomes slower, so that the convergence can be accelerated and the oscillation can be reduced.
Adagrad
Compared with SGD,Adagrad, it is equivalent to imposing one more constraint on the learning rate, namely:
The advantage of Adagrad is that at the beginning of training, because the gt is small, the constraint item can accelerate the training. In the later stage, with the enlargement of the gt, the denominator will become larger and larger, and the final training will end ahead of schedule.
Adam
Adam is a combination of Momentum and Adagrad. It not only takes into account the use of momentum to accelerate the training process, but also takes into account the constraints on the learning rate. The learning rate of each parameter is adjusted dynamically by using the first-order moment estimation and second-order moment estimation of the gradient. The main advantage of Adam is that after bias correction, the learning rate of each iteration has a certain range, which makes the parameters relatively stable. The formula is as follows:
Where:
Summary
It is proved by practice that Adam combines the advantages that Adagrad is good at dealing with sparse gradients and Momentum is good at dealing with non-stationary targets, and the effect is better than other optimizers. At the same time, we also notice that SGD,Adagrad is used as an optimization function in many papers. However, compared with other methods, in practice, SGD needs more training time and may be trapped in the saddle point, which restricts its performance on a lot of real data.
5 selection of loss function
There are also many loss functions to choose from in deep learning, such as square difference function (Mean Squared Error), absolute square variance function (Mean Absolute Error), cross entropy function (Cross Entropy) and so on. In theory and practice, we find that Cross Entropy has obvious advantages over the square variance function which performs better in the linear model. The main reason is that while deep learning updates W and b through reverse transmission, the derivative of the activation function Sigmoid will fall into the left and right saturated intervals when taking most of the values, resulting in a very slow update of parameters. The specific derivation formula is as follows:
A general MSE is defined as:
Where y is the expected output and an is the actual output of the neuron a = σ (Wx+b). Due to the reverse transmission mechanism of deep learning, the modified formula of weight W and offset b is defined as:
Because of the nature of Sigmoid function, σ'(z) will cause saturation when z takes most of the value.
Cross Entropy's formula is:
If there are multiple samples, the average cross entropy of the entire sample set is:
Where n represents the sample number and I represents the category number. If used for Logistic classification, the above expression can be simplified to:
Compared with the square loss function, the cross entropy function has a very good characteristic:
As you can see, without the term σ', the updating of w and b will not be affected by saturation. When the error is large, the weight update is fast, and when the error is small, the weight update is slow.
6 wide-depth model framework
At the beginning of the experiment, we only compared the separate 5-layer DNN model with the linear model. Through the offline / online AUC comparison, we find that the simple DNN model is not obvious for the improvement of CTR. Moreover, the separate DNN model also has some bottlenecks. For example, when the user is an inactive user, the feature vector obtained will be very sparse because of the less interaction between the user and Item, and the deep learning model may be over-generalized when dealing with this situation, resulting in recommending Item with less relevance to the user. Therefore, we combine the extensive linear model with the deep learning model, and include some combined features at the same time, in order to better grasp the common relationship among the three Item-Feature-Label. We hope that in the wide linear part of the wide-depth model, cross features can be used to effectively remember the interaction between sparse features, while in the deep neural network part, the interaction between features can be mined to improve the generalization ability between models. The following figure is the framework of our wide-depth learning model:
In the offline phase, we use Keras based on Theano and Tensorflow as the model engine. During the training, we clean and lift the weight of the sample data respectively. In terms of features, for continuous features, we use Min-Max method to do normalization. In terms of cross-features, combined with business requirements, we extract a number of cross-features that are of great significance in business scenarios. In the model, we use Adam as the optimizer and Cross Entropy as the loss function. During the training period, the difference from the Wide & Deep Learning paper is that we input the combined features as input layers into the corresponding Deep components and Wide components respectively. Then in the Deep section, all the input data is sent to the three ReLU layers, and at last it is scored by the Sigmoid layer. Our Wide&Deep model is trained in more than 70 million training data, and more than 30 million test data are used for offline model estimation. Our Batch-Size is set to 50000 and Epoch is set to 20.
In-depth learning offline / online effects
In the experimental stage, a series of comparisons are made among deep learning, wide-deep learning and logical regression, and the good wide-depth model is put online with the original Base model for AB experiment. From the results, the wide-depth learning model has a good effect both offline and online. The specific conclusions are as follows:
With the increase of the width of the hidden layer, the effect of offline training will be gradually improved. However, considering the performance problem of online real-time prediction, we currently adopt the framework of 256-> 128-> 64.
The following figure is a comparison of the online experimental results between the wide-depth model and the Base model with combined features.
From the point of view of the online effect, the wide-depth learning model solves the problem that historical clicks can be recalled at a long distance. At the same time, the wide-depth model will also recommend some novel Item according to the current scenario.
Summary
Sorting is a very classical machine learning problem, and realizing the memory and generalization function of the model is a challenge in the recommendation system. Memory can be defined as reproducing historical data in recommendations, while generalization is based on the transitivity of data correlation, exploring Item that has never or rarely occurred in the past. The wide linear part of the wide-depth model can use cross features to effectively remember the interaction between sparse features, while the deep neural network can improve the generalization ability of models by mining the interaction between features. The on-line experimental results show that the wide-depth model can obviously improve the CTR. At the same time, we are also trying to evolve the model in a series of ways:
1. Integrate RNN into the existing framework. The existing Deep & Wide model only combines DNN with linear model, and does not model the changes in time series. The chronological order of the samples is also important for the ranking of recommendations. For example, when a user visits some remote hotels and scenic spots according to time, and the user asks for the remote city again, the delicious food around the scenic spot should be promoted.
two。 Reinforcement learning is introduced so that the model can dynamically recommend content according to the user's scene.
The integration of deep learning and logical regression not only enables us to have the advantages of both, but also lays a solid foundation for further click-through rate prediction model design and optimization.
reference
1. H. Cheng, L. Koc, J. Harmsen etc, Wide & Deep Learning for Recommender Systems, Published by ACM 2016 Article, https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf
2. P. Covington, J. Adams, E. Sargin, Deep Neural Networks for YouTube Recommendations, RecSys'16 Proceedings of the 10th ACM Conference on Recommender Systems, https://arxiv.org/pdf/1606.07792.pdf
3. H. Wang, N. Wang, D. Yeung, Collaborative Deep Learning for Recommender Systems.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.