Real-time feature platform of Ctrip based on Flink 04/10 Update SLTechnology News&Howtos

Real-time feature platform of Ctrip based on Flink

2025-04-10 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Author: Liu Kang

This article comes from the Flink Meetup conference held in Shanghai on July 26th. The sharing comes from Liu Kang, who is currently engaged in the development of model lifecycle-related platform in big data platform Department, and is now mainly responsible for developing real-time model feature computing platform based on flink. Familiar with distributed computing, have rich practical experience and in-depth understanding of model deployment and operation and maintenance, and have a certain understanding of model algorithm and training.

The main contents of this paper are as follows:

Based on the present situation of the company's real-time feature development, this paper explains the development background, goal and current situation of the real-time feature platform.

The reason for choosing Flink as the platform computing engine

Flink practice: representative usage examples, development for Aerospike (platform-compatible storage media), and pitfalls encountered

Current effect & future planning 1. On the basis of the current situation of the company's real-time feature development, explain the development background, goal and current situation of the real-time feature platform. 1. The development, operation and maintenance of the original real-time feature job.

1.1.Select the real-time computing platform: according to the performance requirements of the project (latency,throughput, etc.), select the existing real-time computing platform: Storm Spark flink

1.2 main development, operation and maintenance processes:

More than 80% of jobs need to use message queue data sources, but message queues are classified as unstructured data and there is no unified data dictionary. So you need to parse the message and determine the required content by consuming the corresponding topic.

Design and develop computing logic based on the scenarios in the requirements

In the case that real-time data can not fully meet the data requirements, separate offline jobs and fusion logic are developed.

For example, in a scenario where 30 days of data is required, but there is only data within seven days in the message queue (the default retention time of messages in kafka), the remaining 23 days need to be supplemented with offline data.

Checksum and error correction logic of design and development data

Message transmission depends on the network, message loss and timeout can not be completely avoided, so there needs to be a check and error correction logic.

Test online

Monitoring and early warning 2. Pain points in the development of original real-time feature jobs

There is no unified data dictionary for message queue data source structure.

The feature computing logic is highly customized and the development and testing cycle is long.

When real-time data can not meet the needs, offline jobs and fusion logic need to be customized.

The checksum and error correction scheme does not form the best practice, and the actual effect depends on the individual ability.

Monitoring and early warning programs need to be customized based on business logic. 3. Based on the pain points of arrangement, the platform objectives are determined.

Real-time data dictionary: provides unified data source registration and management functions, supports topic with a single structure message and topic with multiple different structure messages

Logical abstraction: abstract to SQL to reduce workload & lower the threshold of use

Feature fusion: provide the function of fusion features to solve the situation where real-time features can not fully meet the data requirements

Data checksum error correction: provides the function of using real-time features of offline data checksum and error correction

Real-time computing latency: ms level

Real-time computing fault tolerance: end-to-end exactly-once

Unified monitoring and early warning and HA scheme 4. Characteristic platform system architecture

Cdn.xitu.io/2019/4/26/16a58bda2256a5fc?w=865&h=525&f=png&s=57691 ">

The current architecture is a standard lamda architecture, and the offline part consists of spark sql + dataX. KV storage system Aerospike is currently used. The main difference from redis is that it uses SSD as the main memory. We have tested that the read and write performance of most scenarios is on the same data level as redis.

Real-time part: use flink as a computing engine to introduce how users use it:

Registered data sources: currently, the supported real-time data sources are Kafka and Aerospike. If the data in Aerospike are offline or real-time features configured on the platform, they will be registered automatically. Kafka data source needs to upload the corresponding schemaSample file.

Computational logic: expressed through SQL

Define output: define the Aerospike table of the output and the Kafka Topic that may be required, the key used to push the data of Update or Insert

After the user completes the above operation, the platform writes all the information to the json configuration file. The next step is for the platform to submit the configuration file and the previously prepared flinkTemplate.jar (including the flink capabilities required by all platforms) to yarn to start flink job.

5. Function display of the platform

1) platform feature display-data source registration

2) Real-time feature editing-basic information

3) Real-time feature editing-data source selection

4) Real-time feature editing-SQL calculation

5) Real-time feature editing-selection output

Second, the reasons for choosing Flink

Next, let's talk about the reasons why we chose flink to be this feature platform.

Divided into three dimensions: maximum latency, fault tolerance, and sql functional maturity

Latency: storm and flink are pure streaming, with a minimum delay of milliseconds. The pure flow mechanism of spark is continuous mode, which can also reach the lowest millisecond delay.

Fault tolerance: storm uses XOR ack mode and supports atLeastOnce. The message is repeated. No. Spark provides exactlyOnce through checkpoint and WAL. Flink achieves exactlyOnce through checkpoint and SavePoint.

Sql maturity: in the current version of storm, SQL is still in an experimental stage and does not support aggregation and join. Spark now provides most of the functionality, but does not support distinct, limit, and order by of aggregate results. Flink now provides sql in the community version, which does not support distinct aggregate.

III. Flink practice

1. Practical example

2. Compatible development: flink does not provide read and write support for Aerospike, so secondary development is needed.

3. The pit encountered

IV. The current effect of the platform & future planning

Current effect: reduce the real-time feature launch cycle from the original average 3-5 days to the hourly level. Planning for the future:

Improve the function of the feature platform: fusion features, etc.

Simplify steps and improve user experience

According to the requirements, further improve the functions of SQL, such as supporting the start time of win offset, through the win of countTrigger, etc.

The next step is to describe model deployment and model training through sql or DSL.

For more information, please visit the Apache Flink Chinese Community website.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.