Practice and Application of Flink in Meituan 07/03 Update SLTechnology News&Howtos

Practice and Application of Flink in Meituan

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Liu Dishan

This article is compiled from the Flink Meetup held in Beijing on Aug. 11, sharing guest Liu Dishan (who joined the Meituan data platform in 2015). It is committed to creating an efficient and easy-to-use real-time computing platform and exploring enterprise-level solutions and integrated services for real-time applications in different scenarios.

Meituan real-time computing platform status and background real-time platform architecture

Cdn.xitu.io/2019/4/26/16a58710854ff8c5?w=1866&h=996&f=jpeg&s=371026 ">

The picture above shows the brief architecture of the current Meituan real-time computing platform. The bottom layer is the data cache layer. You can see that all the log data tested by Meituan are collected to Kafka through a unified log collection system. As the largest data transfer layer, Kafka supports a large number of Meituan online services, including offline pull, as well as some real-time processing services. Above the data cache layer is an engine layer, and on the left side of this layer are the real-time computing engines we currently provide, including Storm and Flink. Storm was previously deployed in standalone mode. Because of the environment in which Flink is running, Meituan chose On YARN mode. In addition to the computing engine, we also provide some real-time storage features for storing intermediate state, results, and dimensional data of computing. Currently, this type of storage includes Hbase, Redis and ES. Above the computing engine, there is a layer that tends to be varied, and this layer is mainly for students of data development. Real-time data development faces many problems, for example, the debugging and tuning of the program is much more difficult than the ordinary program development. In the data platform layer, Meituan provides a real-time computing platform for users, which can not only host jobs, but also achieve tuning diagnosis and monitoring alarm, as well as real-time data retrieval and rights management. In addition to providing a real-time computing platform for data developers, what Meituan is doing now includes building a metadata center. This is also a prerequisite for us to do SQL in the future. Metadata center is an important part of the real-time streaming system. We can understand it as the brain in the real-time system, which can store the Schema,Meta of data. The top layer of the architecture is the business supported by our real-time computing platform, which includes not only the real-time query and retrieval of online business logs, but also the very popular real-time machine learning. Machine learning often involves search and recommendation scenarios, which are the most prominent features: first, it will generate massive real-time data; second, the QPS of traffic is very high. At this time, it is necessary for the real-time computing platform to carry out the extraction of some real-time features and to realize the search and recommendation service of the application. There are also common scenarios, including real-time feature aggregation, zebra Watcher (which can be considered as a monitoring service), real-time data warehouse, and so on.

The above is the brief architecture of Meituan's current real-time computing platform.

Current situation of real-time platform

The current situation of Meituan's real-time computing platform is that the volume of work has now reached nearly 10,000, the scale of the nodes of the cluster is thousands of other, sky-level messages have reached trillions of levels, and the peak volume of messages can reach tens of millions of messages per second.

Pain points and problems

Meituan encountered some pain points and problems before investigating the use of Flink:

Real-time computing accuracy: before investigating the use of Flink, Meituan's large-scale work was developed based on Storm. The main computing semantics of Storm is At-Least-Once, which actually has some problems in ensuring correctness. Before Trident, Storm was stateless. Although Storm Trident provides a precise development for maintaining state, it is based on serial Batch submissions, so there may be a bit of a bottleneck in processing performance when you encounter problems. And Trident is based on micro-batch processing, which does not meet the higher latency requirements, so it can not meet the needs of some services with high delay.

The problem of state management in flow processing: based on the previous flow processing, the problem of state management is a very big one. State management will not only affect the consistency of computing state, for example, but also affect the performance of real-time computing processing and the ability of fault recovery. One of the most prominent advantages of Flink is state management.

The limitation of real-time computing semantic ability: before real-time computing, most of the data development of many companies are still oriented to offline scenarios, and real-time scenarios have become more and more popular in recent years. What is different from offline processing is that in real-time scenarios, the ideographic ability of data processing may be limited, for example, it needs to develop a lot of functional things on it for accurate calculation and time window.

The cost of development and debugging is high: nearly 10,000 jobs have been run on clusters with nearly thousands of nodes, distributed processing engines, and manual code writing have also brought high development and debugging costs to students in data development, and when they go to maintenance, the cost of operation and maintenance is also relatively high. Flink explores focus

Against the background of the above pain points and problems, Meituan began to explore Flink last year, focusing on the following four aspects:

ExactlyOnce computing power

State management capability

Window / Join/ time processing, etc.

The practice of SQL/TableAPIFlink in Meituan

Next, let's take a look at what problems Meituan encountered when he put into production last year, as well as some solutions, which are divided into the following three parts:

Stability practice-Resource isolation

1. Consideration of resource isolation: by scenario and by business

Different peak periods, different operation and maintenance time.

Different requirements for reliability and delay

Application scenarios are of different importance

two。 Strategy for resource isolation:

YARN is labeled, and nodes are physically isolated.

Isolation of offline DataNode and Real-time Computing Node

Stability practice-Intelligent scheduling

The purpose of intelligent scheduling is to solve the problem of uneven resources. now the common scheduling strategy is based on CPU and memory. In addition, some other problems have also been found in the production process. For example, Flink will rely on the local disk for storage in the local state, so disk IO and disk capacity are also a kind of problem points to consider, in addition to the network card traffic, because the traffic state of each business is different, the distribution will lead to the peak of traffic and fill up a certain network card. As a result, it affects other businesses, so the expectation is to do something intelligent scheduling. What can be done for the time being is from both cpu and memory, and some better scheduling strategies will be made in other aspects in the future.

Stability practice-fault tolerance

1. Node / network failure

JobManagerHA

Pull up automatically

Unlike Storm, it is very simple and rude to know that Storm encounters an exception, for example, an exception has occurred, and the user may not have a more standard exception handling in the code, but it does not matter, because worker will restart the job and continue to execute, and it ensures that semantics like At-Least-Once, such as a network timeout exception, may not have that great impact on him. But what is different about Flink is that his tolerance for anomalies is very harsh. At that time, he considered, for example, the occurrence of node or network failures. Then the single point problem of JobManager may be a bottleneck. If JobManager is hung up, then the impact on the whole job may be irrecoverable. So considering doing HA, the other is to consider some of the operations caused by operation and maintenance factors. In addition, there may be some user jobs that do not enable CheckPoint, but if it is caused by node or network failure, it is hoped that some automatic pull-up policies will be made in the inner layer of the platform to ensure the stability of the job.

two。 Upstream and downstream fault tolerance

FlinkKafka08 exception retry

Our data source is mainly Kafka, read and write Kafka is a kind of very common content that can not be avoided in real-time stream processing, and the cluster scale of Kafka itself is very large, so node failure is a normal problem. On this basis, we have carried out some fault tolerance for node failure, such as Leader will switch when the node is down or when the data is balanced. The read and write of Flink itself is not so tolerant to Leader switching. On this basis, we have retried some optimizations for some specific scenarios and some unique exceptions.

3. Disaster recovery

Multiple computer rooms

Stream hot standby

Disaster recovery may not be considered much, for example, is it possible that all the nodes in a computer room are down or inaccessible? although it is a low-probability event, it will also happen. So now we will also consider doing some deployment of multi-computer room, including some hot backup of Kafka.

Flink platform Flink platform-Job Management

In practice, in order to solve some problems of job management and reduce the cost of user development, we have done some platform work. The following figure shows an interface display of job submission, including job configuration, job life cycle management, alarm configuration, delay display, are all integrated on the real-time computing platform.

05.jpg

Flink platform-Monitoring alarm

We have also done some things in monitoring. For real-time jobs, there will be higher requirements for monitoring. For example, when the job is delayed, it will have a greater impact on the business, so we have made some delayed alarms, including the alarm of the job status, such as the survival status of the job, and the running status of the job, as well as some custom Metrics alarms in the future. Custom Metrics will consider doing some configurable alarms based on the content of the job processing itself in the future.

Flink platform-tuning diagnosis

The real-time computing engine provides unified logging and Metrics solutions

Provide conditional filtered log retrieval for business

Provide metrics query with custom time span for business

Provide configurable alarms for business based on logs and metrics

In addition, it has just been mentioned that when developing real-time jobs, tuning and diagnosis is a difficult pain point, that is, it is not very difficult for users to view distributed logs, so it also provides a unified solution. This solution is mainly aimed at logs and Metrics, and some logs and Metrics reports will be made at the engine level, so it will collect these original logs and Metrics to the Kafka layer through a unified log collection system. In the future, you can find that there are two lower reaches of Kafka, on the one hand, to synchronize the data from the log to ES, that is to say, to enter the log center to do some log retrieval, and on the other hand, through some aggregation processing to flow to write to OpenTSDB to rely on the data, the aggregated data will do some queries, on the one hand, the query display of Metrics, and on the other hand, including some related alarms.

The following figure shows the page of a Metrics query that supports cross-sky dimensions for a current job. It can be seen that if we can make a vertical comparison, we can find out what caused the job at a certain point in time. For example, delay, ah, it is easy to help the user judge some of the problems with his homework. In addition to the running status of the job, we will first collect some basic information of the nodes as a horizontal comparison.

The following figure shows some queries of the current log, which records, because each ApplicationID may change after the job dies, then all jobs are collected based on the unique primary key job name of the job, from the beginning of creation to the currently running log, then the user's cross-Application log query can be allowed.

Ecological construction

Different things have been done to adapt to these two types of MQ. For online MQ, it is expected to consume multiple times synchronously, in order to avoid affecting the online business. The Kafka for the production category is offline Kafka, shielding addresses of some addresses, as well as basic configuration, including management of some permissions, and collection of indicators.

Application of Flink in Meituan

The following will tell you two cases of real use of Flink in Meituan. The first is that Petra,Petra is actually an aggregate system of real-time metrics, and it is actually a unified solution for the company. The main business scenario it faces is to calculate statistics based on the time of the business, and to calculate some real-time indicators, if low latency is required, that is to say, because it is oriented to general-purpose services, because businesses may have their own different dimensions, each business may include application channels, computer rooms, and other dimensions that are unique to each business. And these dimensions may involve more, another thing is that the business may need to calculate some compound indicators, such as the most common transaction success rate, and it may need to calculate the number of successful payments and the ratio to the number of orders placed. Another is that the unified index aggregation may be oriented to a system, for example, some B-side or R-segment monitoring systems, then the system's demand for the index system, that is, I hope that the index aggregation can produce some results in the truest, most real-time and most accurate way, and the data guarantee that its downstream system can truly monitor the current information. The picture on the right is an example of me showing as a Metrics. You can see that others are actually similar to what I just said, that is, the results of the aggregation of some indicators that include different dimensions of the business.

Petra real-time metrics aggregation

1. Business scenario:

Based on business time (event time)

Multi-service dimensions: such as applications, channels, computer rooms, etc.

Compound index calculation: for example, transaction success rate = number of successful payments / number of orders placed

Low latency: second-level result output

Accuracy guarantee of 2.Exactlyonce

Flinkcheckpoint mechanism

3. Data skew in Dimension calculation

Hot spot key hash

4. Tolerance for late data

The setting of window and the tradeoff of resources

When using Flink to do real-time index review system, we focus on these aspects. The first aspect is about accurate calculation, including the use of FLink and CheckPoint mechanisms to ensure that I can do the calculation without losing it. The first one is that the unified Metrics flows into a pre-aggregation module, which is mainly used to do some initialization aggregation. Why is it divided into pre-polymerization and full polymerization to solve a kind of problem, including a question asked by the classmate just now? It is the problem of data skew. For example, when the hot spot K occurs, the current solution is to do some buffering through pre-aggregation, so that the K can be broken up as far as possible, and then all the aggregation modules will be aggregated. In fact, it can only solve part of the problem, so it is also considered later that the optimization of performance includes exploring the performance of state storage. The following words still include the tolerance of late data, because indicator aggregation may have just mentioned that some compound indicators should be included, then the data that the matching indicators depend on may come from different streams, even if they come from the same stream. it is possible that when each data is reported, there may be a situation of late arrival, when it is necessary to tolerate the late arrival of data associations. Tolerance on the one hand is that you can set the delay of late Lateness, on the other hand, you can set the length of the window, but in fact, in the real application scenario, there is also consideration on the one hand, that is, in addition to lengthening the time as far as possible, but also considering the real computational cost, so some tradeoffs have been made in this respect, so the indicator is basically that after full aggregation, the aggregate result will write back Kafka. After the data synchronization module is written to OpenTSDB to do, and finally to grafana to do the indicator display, on the other hand, it may be applied to the Facebook packet synchronization module to synchronize to the alarm system to do some indicators, index-based alarm.

The following figure shows the machine intention of the production-oriented Petra now provided. You can see that at present, some commonly used operators and dimension configurations are defined, allowing the user to process configuration calls and directly obtain a display and aggregation result of the metrics he expects. At present, we are still exploring to do something for Petra based on Sql, because many users are used to saying that I want to write Sql to complete such statistics, so I will rely on Flink's own support for SQl and TableAPI, and will also make some explorations on the Sql scenario.

11.jpg

MLX machine learning platform

The second kind of application is a scenario of machine learning, which may rely on offline feature data and real-time feature data. One is based on the existing offline scene feature extraction, after batch processing, the flow to the offline cluster. The other is the near-line mode, the data from the near-line mode is the existing unified log transferred from the log collection system, which is processed by Flink, including flow association and feature extraction, model training, and transfer to the final training cluster, the training cluster will produce the characteristics of P, as well as the characteristics of Delta. Finally, these features are affected on a training service of online features. This is a relatively common scenario, for example, comparison is both general and general. At present, the main applications may include search and recommendation, as well as some other businesses.

Prospects for the future

In the future, it may also be through and looking forward to doing more things in these three aspects. It has just been mentioned, including state management, that the first one is unified state management, such as SQL-based unified management, hoping to have a unified configuration to help users select some desired rollback points. The other is the performance optimization of large state, because for example, when we do the two-stream association of some traffic data, we also encounter some performance bottlenecks. For, say, memory-based state, memory-based data processing, and RocksDB-based state processing, we have made a performance comparison, and found that there are actually some big differences in performance. So I hope that based on RocksDBBackend, we can do as much optimization as possible, so as to improve the performance of job processing. The second aspect is that the words of Sql,Sql should be every bit, that is, a direction that various companies may be doing at present, because there have been some explorations on Sql before, including providing some Sql representations based on Storm, but there may be some deficiencies in the expression of semantics in the previous words, so I hope that these aspects can be solved based on Flink. And some configuration optimization including the concurrency of Sql, including some optimization of Sql query, all hope that Flink will be able to optimize more things in the future, to really enable Sql to be applied to the production environment.

On the other hand, those who will carry out new scenarios are also exploring new scenarios. It is expected that, for example, besides streaming processing, it is also expected to merge the data under offline scenarios and provide more services to the business through unified Sql API, including the combination of streaming and batch processing.

For more information, please visit the Apache Flink Chinese Community website.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.