Application and practice of Flink in ele.me 07/01 Update SLTechnology News&Howtos

Application and practice of Flink in ele.me

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author: Yi Weiping (ele.me)

Finishing: Ji Ping (Alibaba Real-time Computing Department)

This article will show you the work done by ele.me big data platform in real-time computing, as well as the evolution of computing engines, so that you can understand the advantages and disadvantages of Storm, Spark and Flink. How to choose a suitable real-time computing engine? With what advantages does Flink become the first choice of ele.me? This article will take you to solve the puzzles one by one.

Current situation of the platform

The following is the architecture diagram of the current ele.me platform:

Cdn.xitu.io/2019/4/24/16a4ebcdcad6494b?w=4388&h=2760&f=png&s=1385555 ">

The data from multiple data sources are written to kafka, and the computing engines are mainly Storm, Spark and Flink. The result data from the computing engine is then landed on various storage.

At present, there are about 100 Storm tasks and 50 Spark tasks, and Flink is still relatively small for the time being.

At present, the daily data volume of our cluster is 60TB, and the number of calculations is 1000000000, with 400 nodes. To mention here, both Spark and Flink are onyarn, where Flink onyarn is mainly used for inter-task jobmanager isolation, and Storm is the standalone mode.

Application scenario 1. Consistency semantics

Before we talk about our application scenario, let's emphasize an important concept of real-time computing, consistency semantics:

1) at-most-once: that is, fire and forget, we usually write an java application without considering the offset management of the source or the idempotency of the downstream, which is a simple at-most-once. When the data comes, no matter what the intermediate state is and what the state of writing the data is, there is no ack mechanism.

2) at-least-once: retransmission mechanism, which ensures that each piece of data is processed at least once.

3) exactly-once: coarse checkpoint granularity control is used to implement exactly-once. Most of the exactly-once we talk about refers to the exactly-once in the computing engine, that is, whether the internal state of the operator in each step can be replayed. If the last job fails, whether it can recover smoothly from the previous state does not involve the idempotent concept of output sink.

4) at-least-one + idempotent = exactly-one: if we can guarantee that there are idempotent operations downstream, such as update on duplicate key; based on mysql or you use es, cassandra, etc., you can use the primary key key to achieve the semantics of upset, to ensure that at-least-once, coupled with idempotency is exactly-once.

2. Storm

Ele.me used Storm in its early days. It was Storm 16 years ago, and Sparkstreaming and Structed-streaming began 17 years ago. Storm was used early and has the following concepts:

1) the data is tuple-based

2) millisecond delay

3) it mainly supports java, but now it also supports python and go using apache beam.

4) the functionality of Sql is not complete, we encapsulate typhon internally, and users only need to expand some of our interfaces to use many major features; flux is a better tool for Storm, you only need to write a yaml file to describe a Storm task, to some extent, it meets some needs, but still requires users to be engineers who can write java, data analysts can not use it.

★ 2.1Summary

1) ease of use: its promotion is limited because of the high threshold for use.

2) StateBackend: more external storage is required, such as kv storage such as redis.

3) Resource allocation: worker and slot are set in advance, and because fewer optimization points are done, the engine throughput is relatively low.

3. Sparkstreaming

One day, a business side came to ask if we could write a sql and release a real-time computing task in a few minutes. So we started to do Sparkstreaming. Its main concepts are as follows:

1) Micro-batch: you need to set a window in advance, and then process the data in the window.

2) the delay is in seconds, and the better case is about 500ms.

3) the development languages are java and scala.

4) Streaming SQL, mainly our work, we hope to provide a platform for Streaming SQL.

Features:

1) Spark Ecology and SparkSQL: this is where Spark is better. The technology stack is unified. The packages of SQL, graph calculation and machine learning are all interchangeable. Because it does batch processing first, unlike Flink, its natural real-time and offline api are unified.

2) Checkpointon hdfs.

3) On Yarn:Spark belongs to hadoop ecosystem and has high integration with yarn.

4) High throughput: because it is the way of micro-batch, throughput is also relatively high.

Let's give you an overview of the steps that users of our platform need to quickly publish an operation page for a real-time task. We are not writing DDL and DML statements here, but the way UI presents the page.

On the page, users will be asked to select some necessary parameters, such as which kafka cluster will be selected first, and how much each partition will consume. Reverse pressure is also enabled by default. The consumption location needs to be specified by the user each time. It is possible that the next time the user rewrites the real-time task, he can choose the offset consumption point according to the business needs.

The middle is to ask the user to describe the pipeline. SQL is multiple topic of kafka. Choose an output table for the output. SQL registers the kafka DStream consumed above as a table, and then writes a string of pipeline. Finally, we encapsulate some external sink for users (all kinds of storage just mentioned support, if storage can implement upsert semantics, we all support it).

★ 3.1 MultiStream-Join

Although just meet the general stateless batch computing requirements, but there are users want to say, I want to do stream join how to do, early Spark1.5 can refer to Spark-streamingsql this open source project to register DStream as a table, and then do join operation on this table, but this only supports the version before 1.5, Spark2.0 launched structured streaming after the project abandoned. We have a tricky approach:

Let Sparkstreaming consume multiple topic, but I convert each batch of RDD in the consumed DStream into DataFrame according to some conditions, so that you can register as a table. According to specific conditions, you can simply make a join by dividing it into two tables. This join problem completely depends on the data of this consumption. The conditions of their join are uncontrollable and are the way to compare tricky. For example, the following example, consume two topic, then simply through the filer condition, split into two tables, and then you can make a two-table join, but it is essentially a stream.

★ 3.2 Exactly-once

Exactly-once needs to pay special attention to one point:

We must require the data sink to be stored externally before offset can commit. Whether in zookeeper or mysql, you'd better ensure that it is in a transaction, and after it is exported to external storage (here it is best to guarantee a upsert semantics, according to unique key to achieve upset semantics), and then the source driver here generates kafka RDD,executor according to the stored offeset and then consumes data according to the offset of each kafka partition. If these conditions are met, end-to-end exactly-once can be implemented, which is a major premise.

★ 3.3Summary

1) Stateful Processing SQL (

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.