Real-time risk Control solution based on Flink and Rule engine 04/16 Update SLTechnology News&Howtos

Real-time risk Control solution based on Flink and Rule engine

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Summary page of cases and solutions: Aliyun Real-time Computing Product case & solution Summary for an Internet product, typical risk control scenarios include: registration risk control, landing risk control, transaction risk control, activity risk control, etc., and the best effect of risk control is to prevent it, so pre-warning and in-event control is the best among the three implementation schemes. This requires that the risk control system must be real-time. This paper introduces a real-time risk control solution. 1. The overall architecture risk control is the product of the business scenario. The risk control system directly serves the business system, and is also related to the penalty system and analysis system. The relationships and roles of each system are as follows:

The business system, usually the APP+ backend or web, is the carrier of the Internet business, and the risk is triggered from the business system; the risk control system provides support for the business system to judge whether the current user or event is risky or not according to the data or buried information sent by the business system. A business system that is called according to the results of a risk control system to control or punish risky users or events, such as adding CAPTCHA, restricting login, prohibiting orders, and so on. The analysis system, which is used to support the risk control system, measures the performance of the risk control system according to the data, such as the sudden decrease in the interception rate of a strategy, which may mean that the strategy has become invalid. for example, the time for active goods to be highlighted suddenly becomes shorter, and there may be problems with the overall activity strategy, and so on. The system should also support operators / analysts to discover new strategies. Among them, the risk control system and analysis system are the focus of this paper, and in order to facilitate the discussion, we assume that the business scenarios are as follows: e-commerce business; risk control scope includes: registration, false registration; login, theft number login; transaction, theft of customer balance; activities, preferential activities

Risk control implementation scheme: in-process risk control, the goal is to intercept abnormal events; 2. Risk control system has two technical routes: rules and models. the advantages of rules are simple, intuitive, strong and flexible, so they are active in the risk control system for a long time, but the disadvantage is that they are easy to be breached. Once guessed by the underground industry, it will fail, so in the actual risk control system, it is often combined with model-based risk control links to increase robustness. However, due to the limited space, we only focus on a rule-based risk control system architecture, of course, if there is a model risk control demand, the architecture is also fully supported. The rule is to judge the conditions of things. We assume several rules for registration, login, transaction and activity respectively, such as: the user name is not consistent with the ID card name; the number of registered accounts of a certain IP in the last hour is more than 10; the number of login times of an account in the last 3 minutes is more than 5; a certain account group has recently disappeared to buy more than 100 discount goods; an account has received more than 3 coupons in the last 3 minutes. Rules can be grouped into rule groups, and for simplicity, we only discuss rules here. The rule actually includes three parts: facts, that is, the subject and attributes to be judged, such as the account number and login times, IP and registration times of the above rules; conditions, logic of judgment, such as a certain attribute of a fact is greater than a certain index; index threshold, the basis of judgment, such as the critical threshold of the number of login times, the critical threshold of the number of registered accounts, etc. Rules can be filled out by operational experts based on experience, or excavated by data analysts based on historical data, but because rules can be guessed to fail in attack and defense with the underground industry, they all need to be adjusted dynamically. Based on the above discussion, we design a risk control system scheme as follows: the system has three data flows: real-time risk control data flow, identified by red lines, synchronous calls, core links for risk control calls; quasi-real-time index data streams, marked by blue lines and written asynchronously to prepare index data for real-time risk control. Quasi-real-time / offline analysis data flow, marked by the green line and written asynchronously, provides data for the performance analysis of the risk control system; this section first introduces the first two parts, and the analysis system is introduced in the next section. 2.1 Real-time risk control is the core of the whole system, which is called synchronously by the business system to complete the corresponding risk control judgment. As mentioned earlier, rules are often written by people and need to be adjusted dynamically, so we will separate the risk control judgment part from the rule management part. The background of rule management is the operation service, and the operators carry out related operations: scene management, which decides whether a scene implements risk control, such as an activity scene, which can be closed after the end of the event; blacklist and whitelist, labor / program to find the system blacklist and whitelist, direct filtering; rule management, management rules, including additions, deletions or modifications, such as logging in to add new IP addresses, such as new order frequency check, etc. Threshold management, the threshold of management indicators, for example, if the number of registered accounts for an IP in the last hour cannot exceed 10, then 1 and 10 belong to the threshold. After talking about the management background, the logic of the rule judgment part is also very clear, including pre-filtering, factual data preparation, and rule judgment. 2.1.1 the pre-filtering business system calls the risk control system synchronously after a specific event (such as registration, login, placing an order, participating in an event, etc.) is triggered, with relevant contexts, such as IP address, event identification, etc., and the rule judgment section will decide whether to judge according to the configuration of the management backend. If so, it will then filter the blacklist and whitelist, and then proceed to the next step. This part of the logic is very simple. 2.1.2 Real-time data preparation before judging, the system must prepare some factual data, such as registration scenario. If the rule is that the number of registered accounts of a single IP in the last hour does not exceed 10, then the system needs to go to redis/hbase according to the IP address to find the number of registered accounts of the IP in the last hour, such as 15. In the login scenario, if the rule is that the number of logins of a single account in the last 3 minutes is no more than 5, then the system needs to go to redis/hbase to find the number of logins of the account in the last 3 minutes. For example, we will introduce the data output of HBASE in Section 2.2 in quasi-real-time data stream. 2.2.3 Rule judgment after obtaining the factual data, the system will judge according to the rules and thresholds, and then return the results, and the whole process will be over. The logic of the whole process is clear, we often say that the rule engine mainly plays a role in this part, generally speaking, there are two ways to implement this process: with the help of mature rule engines, such as Drools,Drools and Java environment, it combines very well and supports many features, but it is more tedious to use and has a higher threshold, please refer to article [1]; based on Groovy and other dynamic languages to complete by themselves, I will not repeat here. Please refer to article [2]; both schemes support dynamic updating of rules. 2.2 quasi-real-time data flow belongs to the background logic, which serves for the risk control system and prepares factual data. The separation of data preparation and logical judgment is considered from the point of view of system performance / scalability. As mentioned earlier, rule judgment requires relevant indicators of facts, such as the number of logins in the last hour, the number of registered accounts in the last hour, and so on. These indicators usually have a period of time and are of a certain state or aggregation. It is difficult to calculate based on the original data in the process of real-time risk control, because the rule engine of risk control is often stateless and does not record the previous results. At the same time, this part of the original data is very large, because the original data of user activities have to be transmitted for calculation, so this part is often completed by a streaming big data system. Here we choose Flink,Flink as the indisputable No.1 in the field of stream computing today, and it can complete this part of the work well both in terms of performance and function. This part of the data flow is very simple: the business system sends the buried point data to the Kafka;Flink subscription Kafka to complete the atomic granularity aggregation; note: Flink only completes the atomic granularity aggregation is related to the dynamic change logic of the rules. For example, in the registration scenario, the operator will judge the number of registered accounts of an IP in the last 1 hour, the last 3 hours and the last 5 hours according to the effect. In other words, the N of the last N hours is dynamically adjusted. When calculating, Flink should only calculate the number of accounts for one hour, read the last three hours or five hours according to the rules in the judgment process, and then aggregate and make a judgment. Because in the running mechanism of Flink, the job will run continuously after it is submitted, so if the adjustment logic needs to stop the job, modify the code, and then restart, it is quite troublesome; at the same time, because of the problem of intermediate state in Flink, restart also faces the problem of whether intermediate state can be reused. So if the N-hour aggregation is directly completed by Flink, each change of N needs to repeat the above operation, and sometimes need to track data, which is very tedious.

Flink writes the summary index results into Redis or Hbase for real-time risk control system query. Neither is a big problem, just choose according to the scene. By separating data calculations from logical judgments and introducing them into Flink, our risk control system can cope with a very large number of users. 3. The thing in front of the analysis system is a complete risk control system statically, but it is missing dynamically, which is not reflected in functionality, but in evolution. That is to say, if we look at a risk control system from a dynamic point of view, we need at least two more parts, one is to measure the overall effect of the system, and the other is to provide the basis for rule / logic upgrade for the system. In terms of measuring the overall effect, we need to judge whether the rule is invalid, such as a sudden decrease in the interception rate, whether the rule is superfluous, such as a rule has never intercepted any incident, and whether there are loopholes in the rule. for example, after holding a promotion or issuing vouchers, the benefits are received, but do not achieve the desired results. In terms of providing a basis for rules / logical upgrades for the system, we need to find global rules, such as a sudden 100-fold increase in someone's spending on electronic products, which is problematic alone, but as a whole, many people may have this phenomenon, it turns out that Apple has released a new product. To identify a combination of behavior, a single behavior is normal, but the combination is abnormal, for example, it is normal for users to buy kitchen knives, tickets, ropes, and gas stations to refuel, but it is not normal to do these things at the same time in a short period of time. Group identification, such as finding a group through graph analysis technology, and then tagging all accounts of that group to prevent the situation in which every account behaves normally but the whole group is concentrated. This is the role positioning of the analysis system, some of which are deterministic and some exploratory in his work. In order to complete this work, the system needs as much data support as possible, such as business system data, business burial data, recording detailed user, transaction or activity data. Data intercepted by risk control, buried data in risk control system, for example, a user is intercepted because of a certain rule in a state with certain characteristics, this interception itself is an event data; this is a typical big data analysis scenario, and the architecture is also flexible. I only give a suggested way.

Relatively speaking, this system is the most open, which not only has fixed index analysis, but also can use machine learning / data analysis technology to find more new rules or patterns.

The original link to this article is the original content of Yunqi community and may not be reproduced without permission.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.