Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Zhihu anti-cheating system "Wukong" architecture evolution!

2025-04-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Hi there! It has been exactly three years since Wukong officially met you in April 2015. With the continuous development and growth of Zhihu, Wukong has been facing new tests and continuous optimization and upgrading in the past period of time. Next, I would like to systematically share with you the experience and lessons accumulated in the process of architecture evolution and construction of Wukong in recent years.

Business status

As of May this year, Zhihu had 160 million registered users, and in recent years, in addition to Q & An and column articles, the community has derived some new product lines and product forms. Therefore, the business form of "Wukong" docking has also been expanded, from the initial focus on the control of content class Spam, to behavior class Spam, transaction risk and so on. At present, "Wukong" has covered 10 business lines and nearly 100 function points.

Generally speaking, there are several typical types of Spam that exist for a long time in Zhihu:

Content cheating Spam: the core benefit point of this kind of Spam on the one hand is for intra-site communication, on the other hand, for search engines, to achieve the purpose of SEO. Spam of content class is the mainstream Spam type in the community, which mainly includes four forms:

Diversion content: this kind of Spam can account for 70%-80% of the Spam in the community. Typical examples include training institutions, beauty care, insurance, and Spam related to purchasing agents. The diversion content will involve QQ, mobile number, Wechat, url and even landlines, and all kinds of special Spam will appear at some special time nodes, such as the World Cup, Singles' Day and Singles' 12th, which is a good time for the underground industry to make a lot of money.

Brand content: this kind of content will have typical SEO characteristics, there will be no obvious diversion logo in the general content, and cheating will appear in the way of question and answer, such as what brand is asked in the question? How is the training school there? Then make a recommendation in the corresponding answer.

Fraud content: generally to impersonate celebrities, institutions appear, such as bicycle refund Spam, in the content to provide false customer service phone to fraud.

Harassment content: for example, the bulk content of some inducement and investigation, which seriously affects the experience of bosom friends.

Behavior cheating spam: mainly includes likes, powder, thanks, sharing, browsing, etc., on the one hand, in order to achieve the purpose of maintaining the account, avoid the detection of the anti-cheating system, on the other hand, through the amount of brushing behavior to assist the dissemination of content in the site.

Governance experience

The core point of dealing with the above problems is how to find and control risks agile and continuously, and to ensure the dynamic balance of processing costs and benefits, and to carry out three-dimensional defense from the benefit points of Spam. The so-called three-dimensional defense is to enhance the ability to discover and control risks through a variety of control means and multiple control links.

Three control modes

Strategy anti-cheating: in the initial stage of anti-cheating, when the Spam feature is relatively simple, the strategy is a simple, rough and useful way to solve the problem quickly, so the strategy is a sharp weapon to solve the head problem in the anti-cheating solution.

Product anti-cheating: on the one hand, by changing the product form to effectively control the occurrence of risks, on the other hand, through the product plan, users and Spammer pain points tend to be consistent demand for guidance, sometimes we face Spam problems, for accidental injury and accuracy will encounter a bottleneck, found that it is difficult to distinguish between normal users and Spammer, in this case, through the product solution, there may be a better solution.

Model anti-cheating: machine learning model can fully improve the generalization ability of anti-cheating system and reduce the cost of strategy customization. The application of the model needs to consider adding manual audit to ensure the effect, directly deal with the content or the user's model algorithm, and pay attention to the interpretability of the model. According to our past experience, some unsupervised clustering algorithms can achieve better results in a relatively short time. The supervised classification algorithm will consume more time and manpower, and the integrity of the sample and the quality of feature engineering will affect the effect of the algorithm.

Three control links

Beforehand: several links involved in advance include risk education, business decision-making participation, monitoring and alarm and synchronous interception. Anti-cheating needs to enhance the risk awareness of the business, clearly inform the services that anti-cheating can provide; and participate in the business decision-making in the early stage to avoid greater risks in the product scheme; after the service access, monitor the new business increment, processing capacity, reporting capacity and accidental injury, so as to find the risk in time. At the strategic level, it is necessary to intercept the frequency and resource blacklist of obvious cheating in the head in advance to reduce the pressure of in-process detection.

In the event: facing the middle of the long tail curve, mainly for those cheating behaviors with low frequency and less obvious rules, carry out different levels of treatment for behaviors and accounts with different degrees of suspicion, either submit for trial or restrict behavior. or punish the content and account.

Afterwards: the behavior facing the end of the long tail curve, that is, cheating that is very low frequency or less influential, but with a relatively large amount of computation. Some offline algorithm models and strategies are responsible for detection and control, and after the event, it also involves the effect tracking of the strategy and the optimization of rules, combined with user feedback and reporting to form a detection closed loop.

Wukong V1

In the early stage, Wukong is mainly composed of pre-event modules and in-process modules.

Prior module and business serial execution, suitable for doing some time-consuming frequency detection, keywords and blacklist interception. Because of the synchronous interface, in order to minimize the impact on the business, most of the complex detection logic is handled by the in-process module.

The in-process module detects in the business bypass, which is suitable for some relatively complex and time-consuming detection. The event is mainly composed of Parser and a series of Checker. Parser is responsible for parsing the business data into a fixed format and landing it to the basic event database. Checker is responsible for policy detection by fetching the recent behavior from the basic event database.

Event access

The landing of data in anti-cheating scenarios generally involves these dimensions: who, at what time, in what environment, to whom, and what has been done. Corresponding to the specific field is who (UserID) did what (ActionType ObjID Content) to whom (AcceptID) at what time (Created) and what environment (UserAgent UserIP DeviceID Referer). With this information, policies can be filtered based on dimensions, and extended data for these dimensions can also be obtained (e.g. (the number of approved users) for policy detection.

Policy engine

The policy engine design of Wukong takes full account of the scalability of the strategy. On the one hand, it supports horizontal expansion, that is, it supports obtaining more business data from the basic dimension as an extended dimension, such as user-related information, device-related information, IP-related information and so on. On the other hand, vertical expansion is also taken into account, supporting backtracking in the time dimension, by detecting the correlation dimension (e.g. The behavior on the same user, the same IP), discovers and attacks Spam more efficiently.

Here is a typical V1 strategy:

This strategy mainly implements the following logic:

For the answers created under the same topic in the last 10 minutes, the registration time is within one hour with the current user, and the number of registered users under the same IP is greater than or equal to 3.

Basically, this pattern is enough to meet the needs of daily Spam detection, but we can find that this nested structure is not very friendly for writing and reading, and the optimization of this part will be described in detail in V2.

Considering that the policy change will be much larger than the basic module, in the V1 architecture, we specially split the logic of policy maintenance into services, on the one hand, we can achieve smooth access, on the other hand, reduce the impact of policy changes on stability.

Storage selection

On storage, we chose MongoDB as our basic event storage and Redis as the cache for the key RPC. On the one hand, the reason for choosing MongoDB is that the business scenario of our basic event library is relatively simple and does not require transaction support; on the other hand, we are faced with a scenario in which reading is much larger than writing, and 90% of the queries are based on hot data in the recent period of time, with fewer random reads and writes. MongoDB is very suitable for this scenario. In addition, due to the initial unstable demand, schema-free is also an advantage that MongoDB attracts us. Because we need to call a lot of business interfaces for policy detection, for some features with relatively low real-time requirements, we use Redis as the function cache to relatively reduce the calling pressure of the business side.

Wukong V2

"Wukong V1" has met the daily strategic needs, but we have also found a lot of pain points in the process of use:

The strategy learning curve is steep and the writing cost is high: the strategy mentioned above uses a nested structure. On the one hand, the learning cost is a little high for the students of product operation, on the other hand, it is very easy to make mistakes such as missing brackets in the writing process.

Long cycle of strategy formulation: the process of launching a strategy in "Wukong V1" will probably go through these steps, product formulation strategy-R & D implementation strategy-R & D online strategy recall-product waiting for recall-product confirmation strategy effect-online processing. The whole process involves complex manpower and environment, and strategy verification is troublesome and time-consuming, so the cost of strategy trial and error will be very high.

In view of this, Wukong V2 focuses on improving the experience of policy self-configuration and launch. The following is the architecture diagram of Wukong V2:

Strategy structure optimization

Referring to the functional language, the new policy structure introduces spark-like operators: filter, mapper, reducer, flatMap, groupBy and so on. The general basic policy requirements can be realized through the first three.

For example, after optimization, the strategy mentioned above changes to the following format:

The structure becomes clearer and the expansibility is stronger. In engineering, only reusable operators are needed to meet the daily strategy requirements, whether it is machine learning model or business-related data. It can be used as an operator.

Policy self-service configuration

After completing the optimization of the strategy structure, the next step is to solve the problem of the alternation of the dual roles of the strategy online process research and development and the product. "Wukong V2" supports the strategy self-help configuration and liberates the R & D students from the policy configuration completely. further improve the efficiency of the strategy online.

Policy launch process optimization

How to make the strategy online more agile? This is what we have been thinking about. For every online strategy, we need to strike a balance between accuracy and recall, and hit as many Spam as possible, so every online strategy needs to be tested for a long time, which is a very time-consuming process that needs to be optimized. The process of "Wukong V2" policy launch is optimized as follows: create policy-policy test-policy trial run-policy online processing-policy monitoring.

Policy testing is mainly used for preliminary verification of policies to avoid obvious syntax errors.

The policy trial run can be understood as snapshot replay. By running the data from the past few days and quickly verifying the effectiveness of the policy, everything can be done at the minute level. In this part of the implementation, we make a copy of the resources on which the policy runs, isolate it from the production environment, and implement a coordinator to read out historical events from the MongoDB and queue them. It is worth noting that the speed of joining the team needs to be controlled to prevent the queue from being blown up instantly.

After the verification of the trial run, the policy is ready to go online. After launching, the policy monitoring module provides perfect indicators, including policy execution time, number of policy errors, policy hits and processing capacity, and so on. Only when the data is reflected, can it be invincible.

Wukong V3

In mid-2016, the services of Zhihu main station began to be split vertically, and accordingly, the simplification of Wukong service access costs began to be put on the agenda.

Gateway

Gateway is responsible for interacting with Nginx as a general component for risk blocking of online traffic. At present, Gateway undertakes all anti-cheating and account security user abnormal status interception, anti-cheating function interception and anti-crawler interception. In this way, this part of the logic is separated from the business, especially when the business is split independently, which can greatly reduce the repetitive work of the business. As a general component, it can also improve the stability of interception logic. Gateway's current architecture is shown in the following figure:

Because it is a serial component, all requests must be completed within 10ms, so all states are cached in Redis. Gateway exposes RPC interface (Robot), and related services call Robot to update user, IP, device and other related status to Redis. When the user request arrives, Nginx requests Gateway,Gateway to get the IP, user ID and other information in the request, and queries Redis to return to Nginx. When an abnormal state is returned, Nginx blocks the request and returns the error code to the front end and the client.

TSP-Trust & Safety Platform

TSP mainly provides services for anti-crawler and anti-cheating. On the one hand, it parses bypass mirror traffic, completes traffic cleaning and basic counting through Spark, and then calls the count data to the anti-crawler policy engine through Kafka for detection and processing, thus achieving zero-cost access to the business. On the other hand, because anti-cheating relies on more business data, it is difficult to obtain from the traffic, so kafka access is used instead of RPC access to further decouple from the service and reduce the impact on the service.

With the improvement of the online efficiency of "Wukong" strategy and the gradual increase of online strategies, we began to optimize the detection performance and ability of "Wukong".

Full parallelization of strategy

"Wukong V2" policy detection is distributed on a behavior-by-behavior basis, and the problem is that with the increase of policies, the time for single-line detection will be greatly enhanced. In V3, we optimize this part of the logic, reduce the policy detection distribution to policy granularity, further improve the parallelism of policy operation, and achieve business-level container isolation. After optimization, the event detection module evolved into a three-level queue architecture. The first level is the event queue, and the downstream policy distribution worker will land the data and distribute the policy according to the business type of the event. The policy executes the worker, obtains the task from the second-level queue, carries on the policy detection, and distributes the hit event to the corresponding third-level queue. The third level queue is the processing queue, which is responsible for processing the content of the hit rule or the user.

Cache optimization

Because each policy test involves the backtracking of historical data, it will naturally lead to more repeated queries and greater storage pressure, so we have added multi-level storage on storage. In addition to MongoDB, recent business data in the upper layer are stored in Redis and localcache. Detailed contents of the past technical articles have been introduced in more detail, interested students can take a look at: Wukong anti-cheating system cache optimization

Enhanced picture recognition ability

With the enhancement of the ability of text content detection, many spam begin to use pictures to cheat. In "Wukong V3" we have enhanced the ability of picture-related detection: picture OCR, advertising picture recognition, pornographic picture recognition, illegal picture recognition, and politically sensitive picture recognition. The detection of Spam for image advertising has always been our vacancy, and we need to invest a lot of manpower in model training, so we use a third party to quickly improve the vacancy in this area. After the current access, it has really improved our ability to solve on-site advertising and fraudulent picture Spam.

Further accumulation of risk data

In the early days, as the system is not yet mature, we spend a lot of working time on emergency response to Spam problems, and seldom do the accumulation of risk data in various dimensions. In Wukong V3, we began to accumulate relevant risk data in the content, account, IP and device dimensions for strategy backtracking and model training. At present, we have three data sources: policy, third-party interface and manual tagging. In view of the low efficiency of offline manual labeling and the complex problem of extracting data items, we specially set up a labeling background to improve the efficiency of labeling data of operation students, so that the labelled data can be reused and traceable. Here are some of the more common risk dimensions:

Content dimension: e.g. Diversion advertising, brand advertising, violation of laws and regulations

Account dimension: e.g. Batch behavior (batch registration, like, powder, etc.), risk account (social work database leakage, etc.), junk mobile phone number, risk number section

IP dimension: e.g. Risk IP, agent IP

Equipment dimension: e.g. Simulator, headless browser

Enhanced backtracking ability

In Wukong V3, we also enhanced the backtracking ability of the strategy. On the one hand, we build an untrusted database to cover the Spam content in the new content which is similar to the untrusted content. At present, we use consine-similarity and jaccard for the similarity algorithm. On the other hand, based on Redis, we support fast backtracking based on diversion words, tags, and communities. In this way, it is easier for related behaviors to be gathered together, so that we can break through the time limit and catch up with similar Spam.

In addition, our engineering and algorithm team have made many attempts to introduce the algorithm model.

"networking-ZNAP (Zhihu Network Analysis Platform)"

For a long time in the past, we spent a lot of time solving Spam problems at the behavioral and content levels. But from another point of view, we will find that although underground industry groups have a lot of resources, they also have to consider the input-output ratio. In any case, resources will be reused. What is the way to express the use of such resources? We thought of the picture, which also became the starting point of our "networking" project. We divided the project into several phases:

The first stage, the realization of graph-based analysis capabilities: this phase aims to provide a channel to analyze problems through the network graph, improve the efficiency of operations and products, and quickly identify communities (equipment, IP..), group behavior identification and communication analysis. Help us form the habit of mining problems from a graph point of view, and help us sum up some experience and output some strategies in the process of daily analysis. Based on the user's writing behavior, the data of the graph analysis platform will be users, devices, IP, Objects (questions, answers..) As a node, concrete behavior as an edge. When the behavior occurs, the user and the device, the user and the IP, the user and the corresponding object are associated, and the degree of each node represents the number of associations. In the part of graph data storage, we investigated Titan, Neo4j and TinkerPop at that time, and finally chose TinkerPop as the storage framework and HBase as the underlying storage framework. TinkerPop is one of the top projects of Apache, and it is a graph computing framework for OLTP and OLAP. Its expansibility is very strong. As long as the API defined by TinkerPop is implemented, it can be used as a driver to make storage support graph query, which can reduce the cost of additional storage maintenance and migration. Tinkerpop currently supports HBase, Neo4j, OrientDB, and so on. Query and calculation using Spark is also supported through GraphComputer. Gremlin is a DSL defined by TinkerPop, which can be flexibly used to query graph data.

In the second stage, the ability to realize community discovery based on graph: similar users are transformed into a circle through the form of community, which is convenient for daily analysis and strategy application based on a circle. We use modularity + fast-unfolding to implement the community discovery algorithm. Take the device community as an example, the input of the algorithm is the association between the device and the user, and the output is each device node and each user node as well as their community number. Modularity (modularity) is a very common dimension to measure network partition. The greater the modularity, the more edges fall in a community than expected, and the better the partition effect is. Fast-unfolding is an iterative algorithm, the main goal is to improve the efficiency of community division, so that the modularity of network division continues to increase, each iteration will merge the nodes of the same community, so with the increase of iteration, the amount of computation is also decreasing. The condition that the iteration stops is that the community tends to be stable or reaches the upper limit of the number of iterations.

The third stage is to realize the ability of community classification on the basis of communities: the ability to effectively identify suspicious and non-suspicious communities and help daily analysis and strategies to better combat Spam groups. We use highly explainable logical regression, using a series of community-related features and user-related features for training, which are used as operational auxiliary data dimensions and online strategies with very good results. since June 2017, we have accumulated 4 weeks of suspicious communities and 170 weeks of normal communities.

Text similarity clustering

Knowing that in order to achieve quick results, Spammer on the site tends to generate similar Spam content in large quantities, or to produce specific behaviors intensively. In view of this large number, similarity, and relative clustering, we use Spark to achieve text clustering through jaccard and sim-hash. By clustering similar text, we can catch all the batch behaviors. For more information, please see: Spark's practice in anti-cheating clustering scenarios

Unlogged hot word discovery

Brand content is also known to account for the majority of the site within the Spam type. At present, most of the malicious marketing on the site is for the purpose of SEO, using Zhihu's PageRank to improve the keyword weight of search engines. Therefore, the characteristic of this kind of content is that a large number of keywords (brand-related, category-related words) will be mentioned. Because they are minority brands and new brands, these keywords are generally not included in the thesaurus, which is what we call unknown words (Unknown Words), so we start from the left and right information entropy and mutual information of vocabulary to mine unknown words, and achieved good results. For the details of our implementation, you can refer to our series of articles: anti-cheating new words mining based on left and right information entropy and mutual information.

Guide word recognition

In view of the diversion content in the station, the method of interference conversion + regular matching + matching backtracking is used to identify and control the abnormal diversion information at first, and good results are obtained. In addition, with the strengthening of the renovation, we found that the phenomenon of diversion deformation in the station is also becoming more and more intense. in this regard, we have successfully introduced the model to identify the diversion deformation through BILSTM-CRF. At present, the recognition accuracy of questions and answers is 97.1% and 96.3%, respectively. For those of you who want to know, take a look at our series of articles: the application of algorithms in the community atmosphere (1): identify spam diversion information.

General spam content classification

For the governance of spam content, although there is always a strategy to cover it online, the generalization ability of the policy is limited, and there will always be a new Spam bypass strategy. We try to use deep learning to build a general spam text classification model. The model uses word vector as input, multi-layer Dilated Convolution extracts text features, reweights the convoluted expression by Attention to get high-level features, and finally gets the probability of spam content. For the batch Spam content we encountered recently, the recall rate of a single rule can reach more than 98%, and the accuracy rate can reach 95.6%. We have also accumulated some experiences and lessons in the definition of "general spam content", sample selection and model training. After the follow-up, our engineers will also introduce the details separately. Please follow us.

At this point, the evolution of the architecture of "Wukong" has been introduced to you. The current overall architecture consists of several parts as shown in the figure below:

Gateway: responsible for blocking abnormal user status, business synchronization and anti-crawling.

Business layer: each business side of the docking.

Data access layer: there are two ways in the data access layer, one is through RPC transmission, the other is through kafka message to realize the decoupling of business and anti-cheating system.

Strategic decision-making layer: strategic decision-making layer, which is divided into synchronous decision-making beforehand and asynchronous decision-making afterwards. There are also policy management services, a series of risk analysis and operation tools. According to the suspicious degree of the decision result, either submit it for review or deal with it in different degrees, and confirm that the behavior of Spam will enter the risk database and give back to the strategy for re-use.

Data storage layer: the data storage layer includes the basic event library, risk library, offline HDFS data landing, etc., this piece of data is not only open for anti-cheating systems, but also provided for external model training and online business use.

Data computing layer: this layer includes some offline machine learning models that calculate the model results on a daily basis and land the data on the ground.

Data service layer: because anti-cheating not only depends on your own internal data, but also involves taking relevant data from the business, this layer will involve interaction with business data, environmental data and model algorithm services.

After three years of team efforts, Wukong has built a model-strategy identification-decision-control-evaluation-improved closed loop. Wukong will face greater challenges in the future. Our goal is not only to deal with spam, but also to protect the user experience. There is no end to Wukong's upgrade. We will continue to introduce information about Zhihu anti-cheating in the Zhihu-product and Hacker's log columns. Welcome to follow us.

Author: Zhou Ote Chen Lei Zhang Chunrong Zhai Feng original link: https://zhuanlan.zhihu.com/p/39482667, reprinted from Zhou Ott's Zhihu account.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report