Construction and Exploration of one-stop Machine Learning platform Deepthought 04/18 Update SLTechnology News&Howtos

Construction and Exploration of one-stop Machine Learning platform Deepthought

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

The origin of development

In addition to audio and video, recommendation and other in-depth learning AI applications, iqiyi also has many traditional machine learning application scenarios of data mining and data analysis, such as user prediction, risk control and so on. The pain point of the traditional R & D model is that it will give some inconvenience to algorithm personnel and business personnel, as follows:

1. The user code is implemented based on stand-alone script, which has long processing link, high coupling, difficult to modify and expand, and reduced readability for a long time.

two。 Multiple scenarios and multiple models in the same business are similar in technical processes such as data processing and model training, so it is difficult to reuse scene repetition steps and data results.

3. Different scenarios lead to periodic training, timing estimation, real-time estimation and other requirements, high code requirements for business personnel and algorithm personnel, and high maintenance costs.

4. Algorithms and business personnel have technical barriers to the development of distributed machine learning, resulting in the amount of data and model complexity limited by stand-alone resources.

Based on the above pain points, iqiyi developed an one-stop machine learning platform Deepthought for general machine learning scenarios, which can achieve visual interaction, build a more intuitive and convenient architecture suitable for business scenarios, and real-time prediction services, which is an important part of the deployment of algorithm models to actual business.

Business requirements

Deepthought considered the following basic business requirements at the beginning of its development:

1. The core algorithm is based on distributed machine learning framework encapsulation, dominated by open source encapsulation, supplemented by self-developed implementation, and meets the needs of fast online basic algorithms.

two。 Each link of machine learning and data mining is decoupled, and the output results of different links can be reused.

3. Deeply integrated with big data platform Tongtian Tower, using the project, data and scheduling managed by Tongtian Tower to realize the execution of machine learning tasks online and offline.

4. Reduce the pressure of user code development, realize the scheduling of machine learning tasks and improve the efficiency of algorithm model construction through visual interaction and configuration.

The overall architecture and development history of Deepthought has been iterated to version 3.0. The details are as follows: Deepthought v1.0, a business-oriented machine learning platform.

The machine learning platform which belongs to the anti-cheating business mainly decouples all stages of the machine learning process in the anti-cheating business and manages the business data in the anti-cheating business, such as blacklist, sample and feature management. The Deepthought v1 architecture is shown in the following figure.

Deepthought encapsulates the common two-classification model of business based on Spark ML/MLLib, as well as the common data preprocessing process, such as missing value filling, normalization and so on.

Deepthought v1 works more in feature management and data configuration, and is more inclined to counter cheating itself. The encapsulation of Spark in v1 and the decoupled serial scheduling of processes have been inherited in subsequent versions of Deepthought. Deepthoughtv2.0 version, the general business-oriented machine learning platform has made general improvements based on the experience of v1.0. Through component-based alternative configuration, it realizes common machine learning requirements, such as automatic parameter adjustment, inherits the idea of v1 core implementation, retains the general machine learning process, and reconstructs the overall system by drawing lessons from the mature user experience in the industry. The main updates are as follows: component-based management and scheduling

All data processing and algorithm execution are managed and scheduled through components. In addition to the core logic and scheduling script, all the component information and configuration items can be managed through the background configuration, and all the configuration information of the component can be dynamically rendered during the front-end interaction.

Algorithm expansion

In v2, we continue to add a number of machine learning algorithms, including supervised learning binary classification, multi-classification, regression algorithm, unsupervised learning clustering, graph algorithm, a variety of data preprocessing algorithms and a variety of algorithm evaluation and data analysis visualization components, basically meet the needs of all scenarios of traditional machine learning.

Visual interaction

Each machine learning step is managed and used as a component. Through the interactive operation of dragging and dragging components, users can build a more intuitive and convenient architecture to meet the needs of business scenarios, and the realizable business scenarios are also more flexible and open. In addition, the front end also provides a series of dimensionalized report controls, and the report data read by the system can be dynamically rendered into visual reports.

Scheduled tasks off-line

Communicate with the authority of big data platform Tongtian Tower, read Deepthought task information and schedule it through big data platform Tongtian Tower, and realize the scenario of task timing prediction.

The Deepthought v2 architecture is shown in the following figure.

Deepthoughtv3.0 version, which supports real-time estimation service

V3 extends the function on the basis of v2. In v3, Deepthought supports automatic parameter adjustment and real-time prediction services, and supports parameter server training for very large-scale data and models.

Automatic parameter adjustment

V2 realizes the function of automatic parameter adjustment, which can optimize and train multiple parameters through a variety of parameter adjustment algorithms, and finally find out the parameter combination and optimal model of the optimal evaluation effect. Improve user productivity by automating repetitive tasks, making users pay more attention to problems than models.

The parameter adjustment methods that have been realized include random search, grid search, Bayesian optimization and evolutionary algorithm.

Real-time estimation

Real-time estimation service is an important part of the deployment of algorithm models to the actual business. V3 has implemented the real-time prediction functions of several commonly used models and supports HTTP and RPC protocols. The real-time estimation service loads the core code of estimation through QAE, publishes the HTTP/RPC service through Skywalker/Dubbo, and accesses log monitoring through Hubble and Venus. The core code loads the preprocessing module through initialization, reads the model, and begins to listen to the service port. In order to maximize the use of computing resources and reduce the estimated time, the prediction adopts thread pool + pipeline mode to split multi-line data of a single request.

Parameter server

Due to some internal reasons, Spark-based machine learning can not support super-large-scale data training with more than one million dimensions and more than 100 million rows. The industry has solved the problem of super-large-scale data training through parameter servers. V3 integrates the open source parameter server, and implements the parameter server versions of some commonly used models.

The Deepthought v3 architecture is shown in the following figure.

Partial core implementation and encapsulation

Below, the implementation and encapsulation of some core functions of the platform will be briefly introduced.

Spark ML/MLLib encapsulation

The algorithm component in Deepthought is encapsulated and improved based on the native ml/mllib package of spark. It not only inherits its advantages of parallel computing and channel flow processing, but also adds a series of functions that provide user ease of use, and adds some algorithms that do not exist in the original package.

For example, the gbdt coding algorithm. As a widely used basic machine learning algorithm, gbdt can be used not only in classification and regression, but also in feature construction. Through the trained gbdt model, the original data can be discretized and finally output in a form similar to one-hot, thus the output data can be poured into fm,lr and other models for further training. In Deepthought, the details of each tree are stored by optimizing the gbdt storage function in ml. In the gbdt coding component, multiple trees are reconstructed and the leaf nodes of each tree are encoded through the trained gbdt model and tree details. Finally, the data is poured into all the trees in parallel, and the final results are spliced to get the corresponding coding results. Deepthought enriches more models on the basis of commonly used models, which allows users to have more choices in the construction of business models, thus enriching the functionality of the platform. Data preprocessing

Deepthought not only encapsulates Spark ML/MLLib, but also has a lot of development based on actual business and platform. In order to expand and optimize the function of Deepthought, Deepthought adds the data preprocessing component and the automatic tag conversion function in the training component.

By dragging different preprocessing components, Deepthought will generate usable data that meets the needs of users according to its selected components, such as hierarchical sampling, data compression, data coding and so on. For the convenience of users, all components of deepthough have automatic tag conversion function. Deepthought will map the input tag by map, map each tag to the conversion tag starting from 0 according to the number of tags in the tag, and save the map, so as to keep the prediction component consistent and finally output the tag as is.

Principle of loss real-time output

A good available machine learning model needs to be tuned and tried by algorithm engineers constantly, and the tuning of the model takes a lot of time and energy.

In the application of machine learning, the loss function (loss function) is associated with the optimization problem as a learning criterion, that is, the model is solved and evaluated by minimizing the loss function. The change of the value of the loss function in each iteration indicates the change of the distance between the model and the target. Loss function curve and loss curve represent the convergence speed of model training. Algorithm engineers can judge whether the setting of loss function and over-parameter of the model is reasonable through the curve, and decide whether to stop model training midway, so as to save tuning time and improve the efficiency of model training. The algorithm component of Deepthought platform is based on the improved encapsulation of ml and mllib libraries in spark, but the function of loss curve output is not provided in the two libraries. Deepthought in order to enable users to synchronously view loss curves (similar to tensorflow) during training, based on spark message communication and event bus, the loss value output function is added to the algorithm component, thus the synchronous display function of loss curve is realized. Real-time prediction of thermal loading modification

In the process of real-time estimation using the model online, users need to replace the old model with the newly trained model on a regular basis. In earlier versions of the real-time estimation service, the estimation service code, after loading into QAE, initialized an one-time load model into memory.

As a result, if we need to update the model, we need to gradually replace the QAE through the kill QAE instance to achieve the effect of replacing the old model with the new model. The drawbacks of this approach are:

1. Users are unaware of the switching process and cannot track a response that is specifically the predicted result of an old model or a new model.

two。 Because RPC is a persistent connection, KILL a QAE instance will cause all links in the current instance to be broken. Although the process is short, it can still cause some request jitter.

Through the hot loading transformation of the existing services, the hot loading of the model is realized, and it can support the sharing of resources for multiple models under the same QAE instance, and enable users to explicitly select a specified version to access the service. After stripping the HTTP and RPC framework itself, we encapsulated the general service into a model prediction service, providing a common prediction interface and a management interface (loading model and viewing model). The loaded model is encapsulated into objects and stored in the global object pool. The model routing function is provided in the prediction service to detect whether the user explicitly accesses the specified model and the specified version by parsing the user request. If the user does not explicitly request it, the router invokes the estimation service to the latest version in the object pool.

Current application and follow-up work

Brief introduction of current application

At present, Deepthought has been used by many teams, such as traffic anti-cheating, user behavior analysis, iQiyi, literature and so on. The following is a brief introduction to some typical scenarios that have already done business in Deepthought.

Traffic anti-cheating business

At present, the traffic anti-cheating business has used Deepthought for model training and offline prediction.

Through some data mining tools provided by Deepthought, such as kmeans clustering and isolated forest anomaly detection, anti-cheating students can analyze abnormal traffic characteristics and mine gang characteristics through graph analysis tools. Based on these analysis results, the workflow of Tongtian Tower on big data platform is manually or semi-automatically introduced into feature engineering and tag library.

In the specific model reference, anti-cheating often uses two-classification model, such as LR, GBDT and XGBoost, to classify the feature data. In practical application, the periodic model update and offline prediction are realized by regularly performing Deepthought tasks through the Tongtian Tower of big data platform. Through the automatic parameter adjustment function of Deepthought, the model super-parameters are automatically optimized, and the actual accuracy and recall rate of the model are more than 97%. Recommendation service

Deepthought is also widely used and praised in the recommendation business.

The Deepthouht platform is committed to helping the recommendation business to make the model training process clearer and clearer, so that the partners of the recommendation business can focus more on the model selection, business understanding and other aspects. Students in the recommendation team can build and train the recommendation model through the algorithm components such as gbdt coding, sql customization, lr, collaborative filtering and fm provided by the Deepthought platform.

In the whole recommendation business, Deepthought focuses on model training. By feeding the data prepared by big data platform Tongtian Tower into the Deepthought platform, users construct a series of processes of data preprocessing, data splitting, model training and evaluation by pulling components, and deploy them online after the model training is complete, so as to carry out real-time prediction and attachment. When the whole process is built for the first time, users only need to run the process regularly to update the model.

Among them, the combined model of gbdt coding and fm training is one of the most frequently used models. The recommendation service converts the continuous data into discrete data through gbdt coding, and splices with the existing discrete data to get the data that can be used for fm model training.

When the model training is completed, Deepthought notifies the user of the accuracy of the model, recall and other details by official account and email. When the model is confirmed to be available, the model can be deployed in real time through the Deepthought platform to meet the real-time forecasting needs of the recommended business.

In addition, in order to solve the problem of too much feature combination and stitching in the recommendation business, Deepthought provides feature extract udf function, which allows users to realize atomic operations such as splicing, adding and calculating log of the original data by means of configuration files. And the Deepthought algorithm component also provides idMap mapping function, so that the team that has encoded or map mapped the data can still use the Deepthought platform. Summary and follow-up planning

Deepthought encapsulates and develops a large number of machine learning platforms, enabling users to complete machine learning operations through simple configuration and dragging, helping non-algorithm business students to easily use machine learning and standardizing their use, which greatly reduces the workload of users in repetitive code, algorithms and model management. At the same time, in-depth cooperation with big data platform Tongtian Tower to achieve data development to model training, offline, online prediction of the whole process of closed loop, is an important part of big data team data center.

In the future, Deepthought will continue to improve the convenience of operation and system stability, and will expand the support of existing models in real-time prediction services; at the algorithm level, it will continue to gradually enrich model types and try to add deep learning models.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.