Application case | from Storm to Flink, there are five years of practice of improving real-time computing efficiency. 07/15 Update SLTechnology News&Howtos

Application case | from Storm to Flink, there are five years of practice of improving real-time computing efficiency.

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Author | he Fei

Company introduction: Youzan is a merchant service company that provides e-commerce solutions for the whole industry and scene. In Youzang, a large number of business scenarios rely on the processing of real-time data. As a kind of basic technical components, it serves dozens of business products and hundreds of real-time computing tasks, including large screen of transaction data, real-time statistical analysis of commodities, log platform, call chain, risk control and other business scenarios. This article will introduce the current development of Youzan real-time computing and the current real-time computing technology architecture.

1. The development of real-time computing

From a technology stack perspective, our choices are consistent with most Internet companies, from early Storm to JStorm, Spark Streaming and recently emerging Flink. In terms of the development stage, it has mainly gone through two stages, the initial stage and the platform stage; the following will introduce the development process of real-time computing according to the timeline in the following figure.

Cdn.xitu.io/2019/7/4/16bbcbc639be0c43?w=800&h=427&f=png&s=70205 ">

1.1 initial stage

The basic characteristics of the initial stage here are the lack of overall real-time computing planning, the lack of platform-based task management, monitoring and alarm tools, and the fact that users submit tasks directly by logging in to the AG server and using command-line commands to submit tasks to the online cluster, which is difficult to meet the usability requirements of users. However, a large number of internal real-time computing scenarios have been accumulated in the initial stage.

1.1.1 Storm debut

At the beginning of 2014, the first Storm application began to be used in Youzai. The initial scenario is to decouple the statistics of real-time events from the business logic. The Storm application does real-time calculation by listening to MySQL's binlog update events, and then updates the results to MySQL or Redis cache for online system use. Similar scenarios have been recognized by business development, and gradually began to support a large number of business scenarios.

In the early days, users logged in to a group of AG servers in the online environment and submitted tasks to the Storm cluster through the client of Storm. In this way, Storm components accumulated nearly 100 real-time applications in more than two years. Storm also exposes a lot of problems, mainly reflected in the system throughput, which is huge for throughput, but not sensitive to latency.

1.1.2 introduction of Spark Streaming

At the end of 2016, as the Spark technology stack became more and more mature, and because the Storm engine itself had obvious disadvantages compared with Spark Streaming technology stack in throughput / performance, some business teams began to try new streaming computing engines. Because of the experience of offline computing and a large number of Spark tasks, Spark Streaming naturally becomes the first choice. With the access of real-time applications of business log system and buried point log system, a large number of business parties begin to access gradually. Like Storm, after completing the development of real-time computing tasks, the business side submits the tasks to big data Yarn cluster through a group of AG servers and using Spark clients.

The initial phase lasts a long time. Around the end of 2017, the deployment of Youzan real-time computing is shown below:

1.1.3 Summary

This architecture is not a big problem in the case of a small amount of business, but with the increase in the number of tasks on the application side, some problems in operation and maintenance are exposed, mainly in the following aspects:

Lack of business management mechanism. Big data team platform group, as a cluster manager, is very difficult to understand the business ownership of real-time tasks running on the current cluster, which leads to when the cluster has usability problems or the cluster needs to make changes and upgrades. unable to efficiently notify the business side to deal with, the communication cost is very high The monitoring and alarm of Storm and Spark Streaming are implemented respectively, and they are in the instrumental stage. For the sake of availability, many business parties will customize their own monitoring and alarm tools, resulting in many repeated wheels, affecting development efficiency; computing resources are not isolated. Resource management is rough, and there is no isolation between offline system and real-time system; early offline tasks and Spark Streaming tasks run on the same group of Yarn resources. At the peak of offline tasks in the early morning, although there is Queue isolation for CapacityScheduler in the Yarn layer, it is inevitable that the network card and disk IO layer of the common physical machine in the HDFS layer will interact with each other, resulting in a large number of delays in real-time tasks in the early morning; lack of flexible resource scheduling. The user starts the real-time task through the AG server, and the cluster resources used by the task are also specified in the startup script. This method has great drawbacks in system availability. When the Yarn resource pool where real-time computing is located fails, it is difficult to switch between clusters of real-time tasks.

Generally speaking, there is a lack of a unified real-time computing platform to manage all aspects of real-time computing.

1.2 platform phase 1.2.1 build a real-time computing platform

In the previous section, in the face of the four problems mentioned above, the initial requirements for a real-time computing platform are as follows:

Business management function. It mainly records the relevant information of real-time applications and correlates with business interfaces; provides task-level monitoring, task failure automatically pulls up, user-defined alarms based on indicators such as delay / throughput, traffic trend market, etc.; make cluster planning to build independent computing Yarn clusters for real-time applications to avoid the interaction between offline tasks and real-time tasks Provide task zero switching computing cluster to ensure that in case of cluster failure, it is convenient to migrate tasks to other clusters for temporary shelter.

So at the beginning of 18, we set up a project to do the first phase of the real-time platform, as an attempt, we only completed the support for Spark Streaming real-time computing tasks, and completed the migration of all Spark Streaming tasks in a short time. After 2 months of trial operation, it is obvious that the control over the business has become stronger. Then support for Storm tasks began and all Storm real-time computing tasks were migrated. All the AG servers are offline, and the business side no longer needs to log in to the server to submit tasks.

In 2018, it is praised that the real-time tasks of the two computing engines of Storm,Spark Streaming are running online, which can meet most of the business needs, but the two engines have their own problems. Storm itself has the limitation of throughput capacity. Compared with Spark Streaming, the choice seems more difficult. We mainly consider it from the following angles:

Delay, Flink wins, Spark Streaming is essentially a micro-batch computing framework, processing delay is generally the same as Batch Interval, generally at the level of seconds, in a good re-throughput scenario, the size of batch is about 15 seconds.

Throughput, after actual testing, under the same conditions, the throughput of Flink will be slightly lower than that of Spark Streaming, but there is almost no difference in state storage support. Flink wins in this respect. For state data with a large amount of data, Flink can choose to directly store local memory or RocksDB of computing nodes to make full use of physical resources.

For the support of SQL, the latest stable versions of SQL functions of the two frameworks at that time were investigated, and it was found that Flink also had great advantages in supporting SQL, mainly in supporting more syntax.

API flexibility, Flink's real-time computing API will be more friendly.

For the above reasons, Youzan began to add support for the Flink engine in the real-time platform. After the integration of the Flink engine is completed, the deployment of Youzan real-time computing is shown below:

1.2.2 New challenges

After the above is completed, stable / reliable real-time computing services can basically be provided; then, the problem of business development efficiency begins to stand out. The general access process for users consists of the following steps:

It takes about half a day to be familiar with the use of SDK in a specific real-time computing framework; it takes about a few hours to apply for upstream and downstream resources for real-time tasks, such as message queues, Redis/MySQL/HBase and other online resources; real-time task development, testing, depending on complexity, generally takes about 1 to 3 days For complex real-time development tasks, it is difficult to guarantee the code quality of real-time tasks, and it is very difficult for platform groups to do code review for each business party, so there are often improper applications released online after small traffic tests in the test environment are normal, causing a variety of problems.

According to the whole calculation, the whole process will take at least 2-3 days, and the efficiency of real-time application access has gradually become the most thorny problem at present. On this question. After doing a lot of research work, we finally determined two directions of real-time computing:

SQL of real-time tasks; for general real-time data analysis scenarios, other technology stacks are introduced to cover simple scenarios. 1.2.2.1 SQL of real-time tasks

SQL of real-time tasks can greatly simplify the cost of business development and shorten the online cycle of real-time tasks. SQL for real-time tasks is based on the Flink engine and is currently under construction. Our current plan is to support the following features first:

Stream-to-stream real-time task development based on Kafka stream support for UDF in HBaseSink-based stream-to-storage SQL task development

At present, the support of real-time tasks based on SQL is in progress.

1.2.2.2 introducing a real-time OLAP engine

Through the observation of the business, we find that in the real-time application of business, there are a large number of requirements for statistics of uv,pv class statistics in different dimensions, and the pattern is relatively fixed. For this kind of demand, we focus on supporting real-time data update and storage engine on real-time Olap query.

We mainly investigated two technology stacks of Kudu,Druid. The former is the C++ implementation and the distributed storage engine, which can efficiently do Olap query and support detailed data query. The latter is the pre-aggregate Olap query engine for event data implemented by Java.

After comprehensively considering the operation and maintenance costs, the integration with the current technology stack, query performance, and supporting scenarios, Druid is finally selected.

At present, the overall technical architecture of real-time computing is as follows:

two。 Future planning

The first thing to do is to SQL real-time tasks to improve the business scenarios that SQL tasks can cover (the target is 70%), so as to empower the business from the point of view of improving the efficiency of business development.

After the initial completion of the real-time task of SQL, the reuse of stream data has become the highest ROI measure to improve efficiency. It is planned to start the construction of real-time data warehouse, as shown in the following figure:

Of course, a complete real-time data warehouse is by no means so simple, not only for the infrastructure related to real-time computing to reach a certain level of platform, but also on the construction of supporting components such as real-time metadata management, real-time data quality management and so on. It's a long way to go.

3. Summary

Youzan real-time computing is pushed forward under the needs of the business side, and the technical direction is constantly adjusting towards the current direction of the highest input-output ratio at different stages. This article does not go into the technical details, but describes the development process of real-time computing along the timeline. In some places, it is inevitable to make mistakes because of the author's limited knowledge. Colleagues are welcome to point out.

4. Introduction to the author

He Fei, who joined Youzanda data team-basic platform Group in July 2017, has been responsible for the landing of Zanda HBase storage and the platform work of various components of the data foundation. Likes big data team is one of the core technical teams of likes sharing technology, the team is mainly composed of algorithms, data products, data warehouse and underlying basic platform four teams, a total of 50 excellent engineers.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.