Big data's Anti-Fraud Technical Framework 07/03 Update SLTechnology News&Howtos

Big data's Anti-Fraud Technical Framework

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

More than a year ago, a friend asked me to talk about how your big data anti-fraud architecture was implemented, and what pits we stepped on along the way, and how to complete real-time anti-fraud from 30min delay optimization to 1s. At that time, firstly, I felt that it was inappropriate, and secondly, I felt that the scene was relatively limited and there was no need to share it.

Time has passed for a long time. Recently, I saw some things in the circle and found that this set of architecture at that time was not outdated and still had great reference value. Therefore, I would like to talk to you today about how to build a big data anti-fraud system. The main source is from the practice when I worked, as well as the practice of communicating with many big shots in the industry. It is a relatively good practice of small success.

When I did this set of architecture, the main field was big data anti-fraud in the credit industry. Later, I also saw the architecture of e-commerce and financial big data. I found that in fact, what everyone used was almost the same routine, but there were different details in each link.

Big boss said, can use pictures, try not to type, then I will type less words, do more pictures. Big data is actually just a few steps. Data source exploitation, data extraction, data storage, data cleaning and processing, data application, and listen to me one by one.

data source

The data source is a relatively important point. After all, if even the data source is garbage, then there is no doubt that the final output must be garbage. Therefore, when selecting the data source and docking the data source, we should pay attention to whether the data produced by the organization is of relatively high quality.

For example, the credit data of the People's Bank of China is a very, very high-quality data, mainly involving credit card, bank flow, lao Lai, breach of trust, enforcement information, etc., all of which are very core. Any point may be a sign of bad debts. And paid confidential data from various administrative agencies.

For example, operator communication data, such as behavior data of large-scale e-commerce, such as various insurance data, and mutual communication of loan records of various institutions, these data sources are very core and valuable, which is the core data of anti-fraud now.

Of course, there is also a more crude and efficient way to buy external blacklist data directly, which makes anti-fraud easier and rejects it directly, which can reduce the cost of manpower and material resources to do other checks.

data extraction

After having high-quality data sources, of course, it is the problem of how to extract them. The data formats provided by various organizations are various, including json and xml of http interface, etl of other internal data sources, Excel of regular manual reporting, and sqoop+ozzie. These two direct data extraction channels only need to ensure the stability of the channel and the idempotent of the data service. There is nothing special about it.

data storage

Data storage is actually the establishment of a data warehouse and a real-time warehouse, data warehouse is used to store the original data from major data sources, real-time warehouse is used for the core operations of business systems, data warehouse data volume is generally T as a unit, real-time warehouse M and G as a unit.

Offline Computing & Real-Time Computing

Data is guaranteed, then computing is the core of this architecture, from a large point of view can be divided into offline computing and real-time computing.

Offline computing does two main things. Hive, Spark data integration and cleansing and offline data modeling. The main thing Hive data integration does is to clean and filter the things in each database, and finally write them to the standard tables we define, and provide them for downstream calculations. If it is a very complex data cleaning, we will use Spark to write programs to do it, after all, there are some operations that Hive standard SQL can not solve. Offline data modeling is to model this batch of data for subsequent use in real-time calculations and applications. Algorithms are stable if you are proficient in two basic ones. LogisticRegression & Decision Tree. Don't ask me why.

What does real-time computing do? SparkStreaming and Flink are used for real-time stream computing, mainly for statistical things, and join operations for multiple data streams. What do we hope to accomplish here? That is, I hope that the service can be quasi-real-time. What is quasi-real-time? That is, within an acceptable range, I allow you to have a certain delay. The delay for our initial task is 30 minutes.

What holes have we stepped in?

At the beginning, we hope to use stream batch computing to achieve real-time computing. In practice, quasi-real-time and real-time are still very different. A service usually allows no minute-level delay. However, GraphX of Spark must have minute-level delay, which is only suitable for offline computing.

Hive + Ozzie offline batch processing is a very big weapon, many people think Hive data cleaning does not write a few lines of SQL? Complex data cleansing rules behind hundreds or even thousands of tables, task dependencies, task reruns, data quality, data kinship maintenance. Trust me, without care and tools, these can break you.

ElasticSearch cluster load throughput of multiple machines, higher than a single machine high performance, after all, the network card is there.

We went through a lot of pits, touched a lot of time, finally decided to put all real-time operations on ElasticSearch and Neo4j architecture, because we not only need real-time full-text full-field social relationship generation, but also need real-time search multi-dimensional multi-layer social relationship and anti-fraud analysis, and this relationship may be millions of levels, according to the six-degree theory, decided that we can not select too many levels, so finally we only extract three of them Social relationship. Finalizing this architecture, which is central to determining our response time and ultimately the availability of our services.

The resulting data generated in many places is just a detail along the decision chain, so we also need rules engines like Drools to help make a final decision.

business applications

How should the final business system be used and how should it provide services to the outside world? This is also a very core problem because this part requires very, very stability, and very, very high efficiency, generally does not allow too high latency, and also requires very high concurrency. This requires us to first try to improve the computational efficiency, and second, we must have a very high level of protection for the system architecture.

Computational efficiency should be efficient. What skills are there to ensure that the interactions between various systems are the results of aggregation, processing and calculation, rather than the original data? After all, network transmission requires high cost in scenarios where the target data volume is very large. For example, if hundreds of thousands of pieces of data had to be loaded at one time, wouldn't it be stupid to pull them all back and recalculate them? Why not offer it as a data service in the target system?

Technical architecture guarantee, in fact, most of them are infrastructure matters, such as dynamic Load Balancer, one master with multiple slaves, disaster recovery of multiple computer rooms in different places, network disconnection drill, upstream service failure plan, etc.

Modeling Social networks

A long time ago has introduced various community discovery algorithms, here will not repeat, interested in their own point to learn more about.

Here's a look at how a standard knowledge graph is built.

1. Subject confirmation

2. Relationship establishment.

3. Logical reasoning is established.

4. Atlas retrieval

Subject validation, from a graph perspective, identifies vertices that have their own attributes and represent individuals in the network.

Relationship establishment, obtained from other data relationships, can also be obtained according to the logical reasoning of step 3. From the perspective of the graph, it is to identify each edge. These edges have starting points, ending points and their own attributes, representing the association of each individual in the network.

Logical reasoning establishment, this is a very important part, such as Yao Ming's wife's mother, Yao Ming's mother-in-law, this a priori knowledge reasoning can be with the help of maps, for us to solve many practical problems.

Atlas search, we started using it with Atlas, we have four sets, subject attribute search, relationship attribute search, breadth first search, width first search. Our general strategy is to identify one vertex first, such as the target person, and then spread out until we find all the individuals who fit the criteria.

What hole did we step on and what optimization did we do here? At first, we pulled the whole search result to the local area for calculation, and the result after graph search was always very large. After all, we found a lot of dimensional relationships, so we were always stuck in the network. After exploration and consultation, we finally confirmed that Neo4j not only provides data query service, but also makes plug-in development based on customized analysis of Social networks, and deploys our anti-fraud service to the server in plug-in form, which reduces a lot of network overhead and ensures the second-level response of our service.

Complete architecture diagram

From the data source, acquisition, storage, processing, application, one step in place, in case of some help, it would be better, if there are doubts, this article from bottom to top, read again.

Welcome to learn Java and big data friends to join java architecture exchange: 855835163

Free architecture information is provided within the group: Java engineering, high performance and distributed, high performance, and simple. High architecture. Performance tuning, Spring, MyBatis, Netty source code analysis and big data and other knowledge points advanced dry goods free live explanation can come in to learn and exchange oh

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.