Architecture and practice of Ctrip Real-time Computing platform (DataPipeline) 04/28 Update SLTechnology News&Howtos

Architecture and practice of Ctrip Real-time Computing platform (DataPipeline)

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Article | person in charge of real-time computing platform of Pan Guoqing Ctrip big data platform

This paper mainly expounds the architecture and practice of Ctrip real-time computing platform from five aspects: the overview of Ctrip big data platform, architecture design and implementation, the process of stepping on and filling holes, the detailed application scenarios in the field of real-time computing, and future planning. I hope it can be used as a reference for companies and students who need to build a real-time data platform.

I. the overall structure of Ctrip big data platform

The structure of Ctrip big data platform is divided into three layers:

Application layer: development platform Zeus (divided into scheduling system, Datax data transmission system, master data system, data quality system), query platform (ArtNova report system, Adhoc query), machine learning (developed based on open source frameworks such as tensorflow and spark; GPU cloud platform based on K8S), real-time computing platform Muise

Middle tier: based on the open source big data infrastructure, divided into distributed storage and computing framework, real-time computing framework

Offline is mainly based on Hadoop, HDFS distributed storage, distributed offline computing based on Hive and Spark, KV storage based on HBase, Presto and Kylin for Adhoc and reporting system

The bottom layer of the real-time computing framework is based on the message queue system Hermes encapsulated by Kafka. Qmq is a message queue developed by Ctrip itself. Qmq is mainly used in the order transaction system to ensure that the message queue built by 100% does not lose data.

Bottom layer: resource monitoring and operation and maintenance monitoring, divided into automatic operation and maintenance system, big data framework facilities monitoring, big data business monitoring.

II. Architecture design and implementation

Introduction to 1.Muise platform

1) what is Muise

Muise, named after the literary goddess muse in Greek mythology, is a platform for real-time data analysis and processing of Ctrip; the underlying Muise platform is based on message queues and open source real-time processing systems JStorm, Spark Streaming and Flink, which can support streaming data processing with seconds or even milliseconds delay.

2) functions of Muise

Data sources: Hermes Kafka/Mysql, Qmq

Data processing: provides Muise JStorm/Spark/FlinkCore API to consume Hermes or Qmq data, while the underlying layer uses Jstorm, Spark or real-time data processing, and provides its own encapsulated API for users to use. API docks all data source systems, which is convenient for users to use directly.

Job management: Portal provides management of JStorm, Spark Streaming and Flink jobs, including functions such as creating new jobs, uploading jar packages and releasing production

Monitoring and alarm: use the Metrics framework provided by Jstorm, Spark and Flink, support custom metrics;metrics information center management, access to Ops's monitoring and alarm system, and provide comprehensive monitoring and alarm support to help users monitor whether there is a problem with the job at the first time.

Current situation of 2.Muise platform

Current status of the platform:

Jstorm 2.1.1 、 Spark 2.0.1 、 Flink1.6.0 、 Kafka 2.0

Cluster size:

13 clusters, 200 + machines 150+Jstorm, 50+Yarn, 100 + Kafka

Job size:

11 lines of business, 350+Jstorm jobs, 120+SS/Flink jobs

Message size:

Topic 1300 +, incremental 100T + PD, Avg 200K TPS, Max 900K TPS

Message delay:

Within Hermes 200ms, within Storm 20ms

Message processing success rate:

99.99%.

The Evolution of 3.Muise platform

2015 Q2o2015 Q3: developing a real-time computing platform based on Storm

2016 Q1N1 2016 Q2: Storm migrates JStorm and introduces StreamCQL

2017 Q1, 2017 Q2: Spark Streaming research and access

2017 Q3 online 2018 Q1: Flink research and access.

4.Muise platform architecture

1) Muise platform architecture

Application layer: at present, Muise Portal mainly supports Storm and Spark Streaming jobs, and supports a series of functions such as new jobs, Jar package release, job running and stopping, etc.

Middle layer: encapsulates the underlying Infrastructure to provide users with API based on Storm, Spark, Flink and Services in all aspects

Bottom layer: Hermes & Qmq is the data source, Redis, HBase, HDFS, DB, etc. as external data storage, Graphite, Grafana, ES are mainly used for monitoring.

2) Muise real-time computing process

Producer side: users first apply for the topic of Kafka, and then write the data to Kafka in real time

Muise Portal side: users develop based on the API provided by us. After development, users configure, upload and start jobs through Muise Portal. After the jobs are started, jar packages are distributed to each corresponding cluster to consume Kafka data.

Storage side: after being consumed, data can be written back to QMQ or Kafka, or stored in external systems such as Redis, HBase, HDFS/Hive, and DB.

5. Platform design-ease of use

First of all: as a platform, the first point of design is to be easy to use. We provide a comprehensive Portal to facilitate users to build and manage its jobs, and to develop real-time jobs that can be put online as soon as possible.

Second, we encapsulate a lot of Core API and support multiple real-time computing frameworks:

Support for HermesKafka/MySQL and QMQ

Integrate Jstorm, Spark Streaming, Flink

Job resource control

Provide DB, Redis, HBase and HDFS output components

Customize multiple metric based on built-in Metric system for job early warning monitoring

Users can customize Metric for monitoring and early warning

AtLeast Once and Exactly Once semantics are supported.

Above mentioned that the design of the platform should be easy to use, the following is about the fault tolerance of the platform to ensure that the data must not go wrong.

6. Platform Design-Fault tolerance

Jstorm: ensuring At Least Once based on Acker mechanism

Spark Streaming: implement Exactly Once based on Checkpoint and implement At Least Once based on Kafka Offset backtracking

Flink: implement Exactly Once based on Flinktwo-phase commit + Kafka 0.11 transactional support.

7.Exactly Once

1) Direct Approach

At present, most people use Spark Streaming to consume Kafka using Direct Approach:

Advantages: record the Offset consumed by each batch, and jobs can be traced back through offset

Disadvantages: data storage and offset storage are asynchronous:

The data is saved successfully, the application is down, and offset is not saved (resulting in data duplication)

Offset saved successfully, application down, data saving failed (resulting in data loss)

2) CheckPoint

Advantages: the running status and source data of each batch are recorded by default, and can be recovered from the cp directory in case of downtime.

Disadvantages:

1. Non-100% guaranteed ExactlyOnce

Https://www.iteblog.com/archives/1795 describes a scenario where Exactly once cannot be guaranteed.

There is also a case of block loss during doCheckPoint in https://issues.apache.org/jira/browse/SPARK-17606.

two。 Additional performance impact with cp enabled

3. Streaming job logic changes cannot be recovered from cp.

Applicable scenarios: more suitable for stateful computing scenarios

How to use: it is recommended that the program store offset on its own. In the event of an outage, if the logic of the spark code has not changed, create a StreamingContext according to the checkpoint directory. If there is a change, create a context and set up a new checkpoint point based on the offset stored by the implementation itself.

8. Platform Design-- Monitoring and alarm

How to help users find homework problems in the first place is a top priority.

Cluster monitoring

Server monitoring: Memory, CPU, Disk IO, Net IO are considered

Platform monitoring: Ganglia

Job monitoring

Native Metric system based on real-time computing framework

Customize Metrics response job status

Capture native and custom Metrics for monitoring and alarm

Storage: Graphite display: Grafana alarm: Appmon

Among the many Metrics customized by us now, the more common ones are:

Fail: number of Jstorm data processing failures, number of Spark task Fail within a regular period of time

Ack: the amount of data processed within a regular period of time

Lag: the intermediate delay between data generation and consumption within a regular period of time (kafka 2.0 is based on built-in bornTime).

Ctrip developed its own alarm system and substituted Metrics into the system to do alarm based on rules. Through the job monitoring Kanban to complete the monitoring and viewing of relevant indicators, we will take Flink as the more concerned Metrics indicators, all of which will be imported into the Graphite database, and then displayed based on the front-end Grafana. Through the job monitoring Kanban, we can directly see the Kafka to Flink Delay (Lag), which is equivalent to the data from generation to consumption by Flink jobs, with an intermediate delay of 62 milliseconds, which is relatively fast. Secondly, we monitored the speed of getting data from Kafka each time. Because getting data from Kafka is based on a small piece, we set the amount of data to pull 2 megabytes at a time. Through the job monitoring Kanban, you can monitor that the average delay of pulling data from Kafka is 25 milliseconds, and the Max is 760 milliseconds.

Next, we will talk about some of the holes we have stepped on in the past few years and how to fill them.

Step on and fill the pit

Pit 1:HermesUBT has a large amount of data and a lot of information, so both the server and the client are under great pressure.

Solution: provide unified offloading jobs that offload data to different topic based on specific rules and configurations.

Pit 2:Kafka cannot guarantee global order

Solution: in a scenario where global ordering is enforced, a single Partition; can be used as a Hash based on a certain field to ensure the internal order of the Partition if it is partially ordered.

Pit 3:Kafka cannot accurately retrace data to a certain period of time based on time.

Solution: the platform provides filtering capabilities to filter data earlier than the set time (after kafka 0.10, each piece of data has its own timestamp, so this problem is naturally solved after upgrading kafka).

Pit 4: at first, all Ctrip's Spark Streaming and Flink jobs run on the host group, which is a large Hadoop cluster. At present, there are thousands of jobs, and offline and real-time jobs are mixed. Once a large offline job comes up, it will have an impact on real-time jobs. Secondly, the Hadoop cluster will often do some upgrades, so it may restart Name Node or Node Manager, which will sometimes cause jobs to hang up.

Solution: we deploy separately, build a real-time cluster separately, and run real-time jobs independently. Offline and offline, real-time to real-time, real-time cluster runs Spark Streaming and Yarn homework separately, offline specializes in offline homework.

When deployed separately, you will encounter new problems. Some real-time jobs need to go to some offline jobs to do some Join or Feature operations, so you also need to access host group data. This is equivalent to a problem of cross-cluster access.

Pit 5:Hadoop real-time cluster access host group across clusters

Solution: Hdfs-site.xml configures ns-prod and ns dual namespace, pointing to the local and host groups, respectively

Spark configuration spark.yarn.access.namenodes or hadoopFlieSystems

Pit 6: both Jstorm and Storm will encounter the problem of CPU preemption. When you take a big assignment, especially the one that consumes a lot of CPU, maybe I give it a Worker and a CPU Core, but it may end up giving me three or even four.

Solution: enable cgroup to limit cpu usage.

IV. Application scenarios

1. Real-time report statistics

Real-time report statistics and presentation is also a scenario widely used by Spark Streaming. The data can be based on Process Time statistics or Event Time statistics. Because the job of different batches of its own SparkStreaming can be regarded as a scrolling window, and an independent window contains data of multiple time periods, there are some restrictions when using SparkStreaming based on Event Time statistics. Generally, the more commonly used way is to count the cumulative values of different time dimensions in each batch and import them into external systems, such as ES;, and then do secondary aggregation based on time to get the complete cumulative value and finally get the aggregate value when the report is displayed. The following figure shows the real-time Kanban implemented by Ctrip IBU based on Spark Streaming.

two。 Real-time data warehouse

1) Spark Streaming stores data in near real time

Today, there are shaped × × tools on the market that can consume data from Kafka in real time, filter and clean them, and eventually land on the corresponding storage system, such as Camus, Flume, etc. Compared with this kind of products, the advantage of Spark Streaming is that it can support more complex processing logic. Secondly, the resource scheduling based on Yarn system makes the resource allocation of Spark Streaming more flexible. Users use Spark Streaming to write data to HDFS or Hive in real time.

2) data quality inspection based on various rules

Based on Spark Streaming, the custom metric function checks and monitors the data quality of data quantity, number of fields, data format and duplicate data.

3) Real-time early warning based on custom metric

Determine some rules based on our encapsulated Metric registration system, and then each batch makes a check based on these rules and returns a result. This result will be based on Metric sink spit out, spit out based on metrics results to do a monitoring. At present, we use Flink to load TensorFlow model to make real-time prediction. The basic timeliness is that once the data arrives within two seconds, the alarm information can be sent out, giving users a very good experience.

V. Future planning

1.Flink on K8S

There are some different computing frameworks within Ctrip, including real-time computing, machine learning, and offline computing, so a unified underlying framework is needed for management, so Flink will be migrated to K8S for unified resource control in the future.

2.Muise platform connects to Flink SQL

Although the Muise platform is connected to Flink, users still have handwritten code. We have developed a real-time feature platform. Users only need to write SQL, that is, Flink-based SQL, which can collect the features needed by users in the model or used in real time. After that, the real-time feature platform will be merged with the real-time computing platform. Finally, users only need to write SQL to achieve all the real-time jobs.

3.Jstorm fully enables Cgroup

At present, due to some historical reasons, many jobs are now running on Jstorm, so there is an uneven allocation of resources, and then Cgroup will be fully enabled.

4. Online model training

Some departments of Ctrip need real-time online model training, after training the model with Spark, and then using the model of Spark Streaming to do a real-time interception or control, applied in risk control and other scenarios.

-end-

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.