Exploration and Application practice of ZB-level big data "with PPT" 04/23 Update SLTechnology News&Howtos

Exploration and Application practice of ZB-level big data "with PPT"

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

According to the report, by 2025, the world will produce 180ZB data. These huge amounts of data are the core production factors for enterprises to carry out digital transformation, but less than 10% of the data are effectively stored, used and analyzed. How to find and analyze valuable information from ZB-level data and give it back to business development is the key. On November 30, big data, the UCan technology salon (Beijing Railway Station) invited five senior big data technical experts to share their exploration and application practice to big data.

The normalization of big data's Business and the Evolution of its structure

Many developers often face the confusion of how to choose big data framework when solving practical business problems. For example, if there are a billion pieces of data that need to be aggregated, should we put the data on HBase+Phoenix, Kudu+Impala or Spark? Which solution can achieve the effect of reducing development and operating costs and high performance? Liu Jingze, an engineer at UCloud big data, shared his thoughts.

In order to analyze and make decisions on the data, we must first have the data source, then store the collected data, then summarize, aggregate, calculate the data, and finally feed back to the data application layer. At present, there are hundreds of mainstream big data frameworks in the market, which are mainly divided into data acquisition layer, data storage layer, data computing layer and data application layer. In addition, a complete set of big data technology stack also includes task scheduling, cluster monitoring, rights management and metadata management.

In the face of a large number of complex technology stacks, the degree of freedom of choice is very high, but the premise is that the strongly dependent framework can not be separated. Here, Liu Jingze gives a general-purpose architecture as shown in the following figure:

OLTP SDK on the left side of the figure refers to the background interface, which can call many of big data's services. The data collected from the interface or from Flume are sent directly to Kafka, then sent to ES, and then modeled through ES. The whole process is equivalent to the use of only the ELK system, although very simple, but this is also a big data framework. For companies with a large amount of data and a wide range of business, the original data is often required to be kept as a cold backup, so HDFS can be used as a data cold backup cluster, and HDFS + Hive as a cold backup is also a very common solution.

When the business scale is large enough, some aggregation operations are needed. If the data pulled from a single framework is incomplete, multiple frameworks may be required to operate at the same time and then join, which is very inefficient. To solve this problem, you can use the idea of a wide table: the first step is to store the business data in MySQL or HBase. Then through Spark or Flink, the desired dimensional data can be taken out from MySQL or HBase by asynchronous IO to join,join. Good data can be stored in HBase. By the time we get to this level, all the data dimensions are very complete. When doing an important indicator analysis, we only need to get the data from HBase. For the indicators that are not very heavy for the business, you can directly connect the business requirements through Phoenix or HBase, Impala and Trafodion, and output the desired results.

In the future, if the business is still too heavy to handle, we can take out the data in the detail data layer HBase and put it into the two stream computing frameworks of Spark and Flink for pre-aggregation, and then connect it to the OLTP system to provide backend services.

It can be seen that there is no unified standard for the selection of big data technology stack, and different business scenarios require different processing methods. As Liu Jingze said: "in many scenes, we have to be consistent in the face of the frame and find out where its real degree of freedom is." don't be limited by them. "

Storage Computing Separation and data abstraction practice

In the early days of big data's birth, the big data cluster of many companies was composed of a huge Cluster array, including many servers, that is, the computing power and storage capacity of the cluster were distributed in a data center. This is due to the poor network conditions at that time, resulting in the high cost of data transmission in task processing, and the local disk transmits faster than the network, so the main idea at that time is to take data as the center of calculation, in order to reduce data migration and improve computing efficiency. The most typical representative here is MapReduce.

In fact, this "resource pool" scheme can not make full use of storage and computing resources at the same time, resulting in a lot of waste. It is also faced with a series of problems, such as difficulties in upgrading various components, inability to treat different data differently, difficulties in positioning, difficulties in temporary deployment of resources, and so on. With the great improvement of network speed, the large-scale expansion of memory and disk capacity, and the iterative update of big data software, how should the previous storage + computing cluster scheme be improved? Liu Baoliang, director of BLUECITY big data, proposed a storage and computing separation architecture, as shown in the following figure:

In order to achieve the separation of storage and computing, first of all, the storage and computing should be separated, at the same time, the internal storage should be separated, and the internal computing should be separated. Storage cluster is the core of this architecture, because big data's most important thing is data; computing cluster is the soul of this architecture, because all flexibility is brought by computing cluster. In addition, a non-blocking network is the most important dependency of this architecture, because once a network problem occurs, the read and write operations of the storage cluster cannot be balanced.

Speaking of the advantages of storage and computing separation, Liu Baoliang particularly emphasized "flexibility" because multi-cluster software and hardware upgrades are easier, data can be graded, new clusters can be temporarily created to deal with emergency problems, and so on. Thus further improve the speed of computing.

Data driven-- from method to practice

The so-called data-driven is to collect a large amount of data through a variety of technical means, and summarize to form information, then integrate and analyze the relevant information and form decision-making guidance. Here, the co-founder of Shenze & Chief architect Fu Li summed up the whole data-driven link into four steps, namely, data acquisition, data modeling, data analysis and data feedback, and these four links should form a closed loop. That is, data feedback will eventually return to data collection.

Data acquisition is the foundation of all data applications, and can be collected from four aspects: client, business, third-party data and offline data. No matter how it is done, it is recommended that when designing the internal technical architecture, we should set a unified data access to API and process and receive the data through SDK or server-side data acquisition tools to facilitate subsequent data modeling.

The second step is data modeling, a basic data model is divided into three parts: events, users, entities, on which users can be divided into groups, for example, according to the user's age, gender, province, mobile devices and other attributes. One of the difficulties in the process of data modeling is ETL. In the case of multiple data sources, it is difficult to find directly available ETL products, so we can build general tasks such as scheduling, computing framework, quality management and metadata management, so as to build the source of data as well as possible, so as to reduce operating costs.

The third step of data analysis, there are two very typical ideas: one is to meet the basic index acquisition requirements through routine reports, and if it is temporary, it should be solved through new development; the other is to use abstract models to cover the index system and most of the analysis requirements, through friendly interaction to allow people who need data to obtain data independently. The flexibility of the latter is much greater than that of the former, and the requirement of flexibility in data analysis is much greater than that of response time. In addition, the interpretability of the data and the simplicity of the overall architecture are also very important considerations.

Challenges and opportunities of Business risk Control in Digital era

The business, marketing, ecology and data of enterprises are facing the increasingly serious threat of underground industry. in the face of the situation of complete underground industry chain and clear division of labor, what challenges are the existing risk control schemes facing?

Liang Yun, CTO of Mathematical Science and Technology, summed up three points: first, the defense capability is weak, relying on blacklist, relying on simple manual rules, and single point of defense (SDK, CAPTCHA); second, the defense timeliness is poor, relying on Test1 offline mining, and the strategy effectiveness cycle is long; third, defense evolution is slow, lack of strategy iterative closed loop, and no self-learning mechanism. So how to improve these problems and establish a complete risk control system?

Liang believes that a full-stack risk control system should include layout and control system, strategy system, portrait system and operation system. In the layout and control system, we can increase the equipment risk SDK, increase login registration protection, and provide business behavior protection. In the strategy system, it can be used to identify and detect the high-risk groups of fraud groups, such as risk equipment such as virtual machine equipment farm, machine registration, library collision and so on. The portrait system can be used for data access in multiple scenes, joint defense and control in multiple industries, and work together against the underground industry. The operation system can be operated through case study, * * research, strategy design, R & D, verification, online, operation and other links to form a complete closed loop, so as to ensure that risk control has been effective.

What kind of architecture do these systems run on? First of all, the risk control system should be decoupled from the business system, so that the upgrading of business rules at any time will not affect risk control, and the change of risk control rules will not affect the business. Another risk control platform structure needs to include a multi-scenario strategy system, a real-time risk control platform and a risk portrait network, as shown in the following figure:

Finally, the architecture of the entire risk control platform is seven global service clusters running on the cloud service infrastructure, with 3 billion daily requests and a peak QPS of 100000 +. The architecture can be divided into access layer, policy engine layer, model engine layer and storage layer. The nodes in each layer are managed by load balancing to achieve dynamic scale-out.

Spark sharing in MobTech Apps

MobTech, as the world's leading data intelligent technology platform, currently covers 12 billion devices, 320000 service developers and 500000 access to APP. The huge amount of data also brings many challenges to MobTech, such as running a large number of Yarn/Spark tasks, large amount of data, high resource overhead, long computing time, and so on.

There are a large number of complex tasks in Mob, and business requirements prompt it to migrate some slow tasks and Hive tasks to Spark to improve performance, while optimizing some Spark tasks. Zhang Juntao, technical architect of MobTech big data, shared two cases around the complex use of Spark: the first is the application of Spark dynamic reduction in MobTech.

The so-called dynamic partition clipping is based on the information inferred by the runtime (run time) for further partition clipping. Suppose table A has 2 billion data, table B has 10 million data, and then join table An and table B. how can we filter out the useless data in table A? here we introduce bloomfilter. Its main feature is to save space, if bloomfilter determines that key does not exist, then it must not exist; if bloomfilter determines that key exists, then it may or may not exist. In short, this is a data structure that sacrifices precision for space. The specific application of Bloomfilter in MobTech is shown below:

The logical SQL is as follows:

SELECT / + bloomfilter (b.id) / a.meme b.FROM a join b on a.id = b.id

The second case is the retrieval and calculation of Spark on hundreds of billions of data. MobTech has more than 4000 tags that need history rewind, and the rewind time period is as long as 2 years, and the rewind frequency is very low. In the face of such cold data, how to complete the business retrieval requirements with low resource overhead? Due to the scattered distribution of data, more than 4000 tags are distributed in different tables (horizontal), and historical data are distributed in the daily table (vertical), which indirectly causes the search to be found in hundreds of billions of data. Here, there are two ideas for indexing:

Horizontal data consolidation: consolidate daily data indexes with more than 4000 tags into one table

Vertical data consolidation: integrate daily data at weekly / monthly level.

The horizontally integrated daily table data is still too large, so it is decided to integrate the date and data ID to make an index table to speed up the query of the daily table, and to ensure that the ID can directly locate which file in the fact table and which line has the information of the ID. The data of the daily table obtains the information of the ID,ORC file name and line number through the API of Spark RDD, and generates an incremental index; the incremental index is merged into the full index through UDAF. The specific plans are as follows:

Due to limited space, for more wonderful technical content, please follow "UCloud Technology" and reply "big data" to get the lecturer PPT~.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.