From Spark Streaming to Apache Flink: the Evolution of Real-time data flow in iqiyi 07/04 Update SLTechnology News&Howtos

From Spark Streaming to Apache Flink: the Evolution of Real-time data flow in iqiyi

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Chen Yuechen

Finishing: Liu he

This article will introduce the production and practice of Apache Flink in iqiyi. You can learn about the background and challenges of iqiyi's introduction of Apache Flink, as well as the process of platform construction. The main contents are as follows:

Iqiyi's Evolution in Real-time Computing and some challenges iqiyi uses Flink's User Case iqiyi Flink platform Construction process iqiyi's improvement on Flink A brief introduction to iqiyi's future work

Cdn.xitu.io/2019/5/22/16ade66402c6f94d?w=1034&h=386&f=png&s=136360 ">

Iqiyi officially launched in 2010 and listed on NASDAQ in March 2018. We have a large and highly active user base with 565 million monthly active users, ranking first in the field of online video. On the mobile side, iqiyi has a total monthly effective duration of 5.908 billion hours, ranking third in China's APP list.

First, iqiyi's evolution in real-time computing and some challenges encountered 1. The Evolution of Real-time Computing in iqiyi

Real-time computing is based on some real-time arrival, uncontrollable rate, independent arrival order does not guarantee the order, once processed can not be replayed unless deliberately saved disordered time series data on-line calculation.

Therefore, in real-time computing, some problems will be encountered, such as data disorder, data delay, inconsistency between event time and processing time and so on. Iqiyi has a peak number of events of 11 million / s, which has encountered great challenges in terms of correctness, fault tolerance, performance, latency, throughput, scalability and so on.

Iqiyi has been using storm on a small scale since 2013, deploying three separate clusters. In 2015, Spark Streaming was introduced and deployed on YARN. In 2016, stream computing began to be used on a large scale in iqiyi after it became a platform for Spark Streaming, built a stream computing platform and reduced user costs. In 2017, because of the inherent defects of Spark Streaming, Flink was introduced and deployed on independent clusters and YARN. In 2018, we will build a Streaming SQL and real-time analysis platform to further lower the threshold for users.

two。 From Spark Streaming to Apache Flink

Iqiyi mainly uses Spark Streaming and Flink for streaming calculation. The implementation of Spark Streaming is very simple, which divides the real-time data into batch tasks through micro-batches, and completes each sub-Batch by batch processing. Spark Streaming's API is also very simple and flexible, using either DStream's java/scala API or SQL to define processing logic. However, Spark Streaming is limited by the micro-batch processing model, so it is very difficult for the business side to complete a real-time computing. For example, based on the time of data events and the processing of late data, users have to do a lot of programming. Iqiyi's extensive use of Spark Streaming here often lies in the collection of real-time data.

The real-time computing model of Apache Flink framework is based on Dataflow Model, which fully supports four problems of Dataflow Model: What, supporting the definition of DAG graph; Where: defining all kinds of windows (fixed window, sliding window and Session window); When: supporting flexible definition of computing trigger time; and How: supporting rich Function definition data update mode. Like Spark Streaming, Flink supports hierarchical API and DataStream API,Process Function,SQL. The biggest feature of Flink lies in the guarantee of the correctness of its real-time calculation: Exactly once, native support for event time, support for delayed data processing. Because the Flink itself is based on native data streams, it can achieve millisecond low latency.

Measured by iqiyi, compared with Spark Streaming,Apache Flink in similar throughput, it has lower delay, better real-time computing ability, native real-time event time, delayed data processing and so on.

Second, some cases of using Flink in iqiyi

The following is an introduction to how iqiyi uses Flink through three Use Case, including real-time ETL of massive data, real-time risk control, and distributed call chain analysis.

1. Real-time ETL of massive data

On iqiyi's side, any behavior of all users on the end will send a log to the nginx server, with a total amount of more than 10 million QPS. For a specific business, they do real-time analysis later and only want to access the data of the business itself, so a data split is involved.

Before the introduction of Flink, the earliest data splitting logic was like this. Countless such rules were configured on the Ngnix machine by the way of "tail-f / xxx/ngnix.log | grep" xxx ". These different data were typed into different business kafka according to different rules. However, with the expansion of the scale of the business line, there are more and more tail processes, and gradually encounter server performance bottlenecks.

Therefore, we have such an idea that we want to split the data into each business kafka through real-time streaming computing. Specifically, all the data on the Nginx is collected to the first-level Kafka, and the data is collected to each business Kafka on demand through the real-time ETL program. At that time, iQI artists' real-time stream computing was basically based on Spark Streaming, but considering the relatively high delay of Spark Streaming, iqiyi started to promote the application of Apache Flink from this case.

The implementation of real-time ETL for massive data mainly includes the following steps:

Decoding: the delivery log format of each end is not uniform. You need to first parse the log of each end into a standardized format according to various decoding methods. JSON risk control is selected here: real-time splitting of the data here will pass the risk control rules and filter out a large part of the brushing logs. Because the magnitude is too high, if you pass the risk control rules for each log, the delay will be very large. Several optimizations have been made here. First, user data is split by DeviceID, and different DeviceID is split to different task manager. Each task manager uses local memory as primary cache, redis and flink are deployed together, and local redis is used as secondary cache. The end result is that redis access is reduced to an average of 4k per second, and the P99 latency of real-time split is less than 500ms. Split: split, sample and filter according to each business: according to the split process of each business, there are sampling, re-filtering and other processes according to the needs of users.

two。 Real-time risk control

Preventing machine from hitting library theft number is a common requirement of security risk control, which is mainly focused on during and after the event. In the event, UHF anomaly detection analysis is carried out to filter the abnormal behavior of users; after the event, the blacklist of IP and device ID is generated for brush prevention in real-time analysis of various businesses.

Here are two examples of using the Flink feature:

CEP: because many underground industry users have some fixed routines, for example, newly registered users may perform one or two operations within a short period of time. We use CEP pattern matching to filter out the multi-window aggregation of underground industry behavior with fixed patterns: there will be some requirements for risk control, it needs to be in different time windows, and some time windows are more demanding. It may be necessary to see how many visits a user has in a second or sub-second, and then count him, and if the result exceeds a certain threshold, he is judged to be an abnormal user. Through the characteristics of low latency and multi-window support of Flink, ultra-high frequency anomaly detection, such as counting the requests of the same user within 1 second, will be identified as an underground industry if it exceeds a certain threshold. 3. Distributed tracking system

Distributed call chain tracking system, that is, full-link monitoring, almost every company will have. In a micro-service architecture, the invocation relationship between services is complex, so it is often difficult to troubleshoot problems and identify performance bottlenecks, so distributed call chain tracking system is needed.

The above figure is a tracking topology diagram of the call chain, each point is a specific application, that is, through which application, and each edge shows how long it takes from this application to the next application.

In addition to macro analysis, the business also wants to look at the analysis of a specific log. Where is it slow and fast when a particular call is made? Therefore, there is another requirement for the call chain, that is, for a specific call, want to see its specific time-consuming.

The simple architecture of the system is as shown above, the upper part focuses on the burying point, and the lower part focuses on the analysis. To put it simply, it is to type all the system call logs into the Kafka through the client SDK burial point and Agent collection, and we analyze them through Flink. For the analysis of statistical class, it is stored in HBase through Flink calculation to provide some monitoring alarms, calling chain top query and so on. In order to meet this kind of requirements, we use the multi-window aggregation feature of Flink to find out which is the actual call chain from the vast log through a window of one or more minutes, and construct the topological call relationship of each application of APP. The second level is based on a result of the first-level analysis to analyze the statistics of the average time of each edge of the topology graph according to each window and each different edge. In addition, we will type the raw data into ES through Flink for users to query directly.

Third, Flink platform 1. Overview

Next, we will mainly introduce the construction of iqiyi's big data platform. The above figure is not limited to Flink, it is the overall architecture diagram of big data platform. In iqiyi, the storage layer is basically based on Hadoop ecology, such as HDFS, HBase, Kudu, etc.; computing layer, using YARN, supports MapReduce, Spark, Flink, Hive, Impala and other engines; data development layer, mainly some self-developed products, batch development has workflow development and data integration in iqiyi. Real-time computing development, streaming computing development, Streaming SQL, real-time analysis and other platform tools can be used.

Next, we will briefly introduce iqiyi's real-time computing and analysis platform.

two。 Real-time computing platform 2.1 streaming task platform

The flow task platform is the underlying platform for iqiyi's real-time computing, which supports the submission, operation and management of flow tasks. The streaming task platform supports multiple resource scheduling frameworks such as YARN, Mesos, Flink independent clusters, and supports the hosting and running of computing tasks such as Storm, Spark Streaming, Flink, Streaming SQL, etc. In terms of function, we support users to directly package programs to upload deployment stream tasks, and also support users to write SQL through Streaming SQL tools for stream computing development. In order to better manage computing tasks, the flow computing platform provides JAR packages, function management, task index monitoring, and resource audit functions.

2.2 Streaming SQL

Both Spark Streaming and Flink have a good SQL optimization engine, but both lack the semantics created by DDL and DML. Therefore, for the business, it is necessary for the business to programmatically define Source and Sink before SQL can be used for subsequent development.

Therefore, Streaming SQL developed by iqiyi defines a set of DDL and DML syntax. Among them, we define four kinds of tables:

Flow table: what is the input source defined? What is the specific decoding method? The system supports the decoding mode of Json as well as user-defined decoding functions.

Dimension tables: mainly static tables, support MySQL, mainly used for flow table Join.

Temporary tables: similar to Hive's temporary tables, user-defined intermediate procedures.

Result table: the specific type of output is defined. What is the source of the output? How do I visit? The output source support here is common analytical databases such as Kafka, MySQL, Kudu, ES, Druid, HBase, and so on.

In order to better support business requirements, StreamingSQL also supports predefined functions related to IP libraries and user-defined functions by default.

The image above is a StreamingSQL application Case that takes time to print P999999P50 to Console.

In order to better support the use of StreamingSQL for business, StreamingSQL provides Web IDE, providing code highlighting, keyword hints, syntax checking, code debugging and other functions.

3. Real-time analysis platform

The real-time analysis platform is a real-time analysis platform with minute delay built by iqiyi based on Druid, which supports the configuration of Web wizard to complete the multi-dimensional analysis of super-large-scale real-time data and generate visual reports with minute delay. The functions supported are: accessing real-time data for OLAP analysis, making real-time alarm, production real-time data interface, configuring monitoring alarm and so on.

Product advantages:

Full wizard configuration: from real-time data to report generation only need wizard configuration to calculate storage transparency: no need to manage big data processing tasks and data storage minute low latency: from data generation to report display there is only 1 minute delay second query: subsecond return analysis report support flexible change requirements: business can flexibly change dimensions, re-online can take effect 3.1 user wizard configuration

The real-time analysis platform abstracts the whole analysis process into four processes: data access, data processing, model configuration and report configuration. Among them, the model configuration is completely in accordance with the OLAP model, which requires real-time data to conform to the star model, and there are timestamps, indicators, dimensions and other fields.

3.2 data processing configuration

In the data processing layer, the real-time analysis platform provides a wizard configuration page, which allows users to configure the data processing process through a pure page. This mainly deals with some simple scenarios and provides a page configuration solution for some novice users who are not even familiar with SQL. For the first time, similar to StreamingSQL, real-time analysis also provides a user-defined SQL way to define the data processing process.

IV. Flink improvement

In the Flink platform, we encountered several Flink problems and made some improvements to them.

1. Improvement-elegant recovery checkpoint

The first improvement is about the elegant recovery of checkpoint. The starting point of this problem is that the business wants to use Spark Streaming to control which checkpoint to recover from, but for Flink, the business cannot control the checkpoint recovery point through code, so it needs to manually specify a checkpoint to restore checkpoint. Therefore, we hope that Flink can restore checkpoint directly through code, just like Spark Streaming.

To solve this problem, we modify the source code to find his latest checkpoint from the actual path when the Flink task starts, and recover directly from that checkpoint. Of course, this can also be selected by the user, if he still wants to restore it in a native way, but it provides an option that supports recovery from the most recent checkpoint.

two。 Improvement-Kafka Broker HA

The second improvement is about a problem with Kafka Broker HA. For example, in the case of a Kafka Broker failure, Kafka works fine, but Flink programs tend to fail. To solve this problem, we deal with the sockerTimeOutException of Flink after Kafka Broker exit, and support the user to configure the number of retries to solve this problem.

5. Future work of Flink

Finally, I would like to introduce iqiyi's future work in Apache Flink. At present, StreamingSQL only supports Spark Streaming and Structured Streaming engine, and Flink engine will be supported soon after, which greatly reduces the Flink development cost of the business. With the increasing scale of Flink tasks, we will focus on improving the maturity of Flink in iqiyi, improving monitoring and alarm, and increasing the resource audit process (currently only Spark Streaming resource audit). In addition, we will study some new features of Flink 1.6and try Kafka 2.0to investigate the Exactly once solution. In addition, we will make some attempts on the new version of Flink to promote batch flow unification.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.