Who can unify streaming computing by interpreting the 13 open source frameworks of 2018 Groupe? 07/13 Update SLTechnology News&Howtos

Who can unify streaming computing by interpreting the 13 open source frameworks of 2018 Groupe?

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

2018 is coming to an end, and I am nearing the end of 018. I have planned a series of technical inventory articles at the end of the year, hoping to give readers a clear picture of the development and changes in important areas of technology in the past year. This paper is an inventory of real-time stream computing at the end of 2018. The author makes an in-depth analysis of the development status of real-time stream computing technology, and makes a comprehensive and objective comparison of the mainstream real-time stream computing frameworks of the current fire. At the same time, the possible development direction of flow computing in the future is predicted and prospected.

We have planned a series of "interpretation 2018" year-end technical inventory articles, hoping to give readers a clear picture of the development and changes of important technological areas in the past year. This paper is an inventory of real-time stream computing at the end of 2018. The author makes an in-depth analysis of the development status of real-time stream computing technology, and makes a comprehensive and objective comparison of the mainstream real-time stream computing frameworks of the current fire. At the same time, the possible development direction of flow computing in the future is predicted and prospected.

Why is real-time streaming computing so popular this year?

In addition to the AI technology that is landing in the Heat this year, real-time streaming computing technology has also begun to enter the mainstream. Major factories are sparing no effort to try the new streaming computing framework to upgrade and replace old systems such as Storm. The sudden disillusionment of P2P rhapsody in the first half of the year made enterprises begin to face up to value investment. The second half of the Internet has already begun, and there are not many people who can squeeze money online, so technology and capital begin to empower offline, and there are not many fantastic ideas such as pinduoduo.

The Internet of things, an early hot field connected online and offline, has now accumulated enough. The annual fee of IoT network card has dropped to less than 100 yuan, and the rise of NB-IoT technology has highlighted great value in animal husbandry, new agriculture and urban management. All the major factories are in the physical fields of shopping for smart cities, smart factories, smart medical care, car networking and so on. But how much does this have to do with real-time streaming computing?

The above areas have a common feature, that is real-time. Urban traffic is moving fast, factories are not waiting for people, hospitals are ranking, takeout orders are running fast, taxi hailing, food ordering, online shopping and so on. People can't stand waiting for a long time, which means the loss of orders. Therefore, millisecond, sub-second big data analysis highlights great value. Flow computing framework and batch computing started almost at the same time, but stream computing can now tap more benefits, so it will become popular.

A Survey of Real-time streaming Computing Framework

At present, the preferred stream computing engines are Flink and Spark, the second echelon Kafka, Pulsar, and minority Storm, JStorm, nifi, samza and so on. Let's briefly introduce the advantages and disadvantages of each system one by one.

Flink and Spark are the first choice for distributed flow computing, which will be compared and analyzed separately below.

Storm, JStorm, Heron: earlier stream computing platforms. Compared with MapReduce,Storm, it is born for stream computing and is the first choice of early distributed flow computing framework. But Storm is at best a semi-finished product, the ack mechanism is not elegant, and the reliability semantics of exactly-once just once cannot be guaranteed. Do not lose data, do not repeat data, do not lose nor heavy just delivery, are different levels of reliability. The LISP dialect provided by Clojure is anti-human grammar, and the learning cost is extremely steep. Later, the Ali middleware team started to develop JStorm. JStorm in the concept of architectural design is better than Storm, throughput, reliability, ease of use have been greatly improved, containerization has caught up with the general trend. Unfortunately, Ali also has Blink (improved version of Flink), there is no room for two tigers in one mountain, the JStorm team embraces change, and the project is basically at a standstill. Also starting a fresh start is the twitter team, which launched a Heron, which is said to have replaced Storm within twitter and has been verified on a large scale. However, Heron is obviously less active and has nothing to say. It is worth mentioning that Heron's storage uses DistributedLog, another open source framework of twitter.

DistributedLog, Bookkeeper, Pulsar, Pravega: when you write Spark Streaming homework, you must be familiar with the code in which the kafka receives the data and saves it to WAL (write ahead log) first. DistributedLog is a distributed WAL (write ahead log) framework, which provides millisecond delay, saves multiple data to ensure data reliability and consistency, and optimizes read and write performance. It can also run on Mesos and Yarn, while providing multi-tenant capabilities, which is consistent with the multi-tenant and enterprise multi-tenant features of the public cloud. Bookeeper is the re-encapsulation of DistributedLog, providing high-level API and new features. Pulsar focuses on computing and front-end data access, catching up with the serverless trend, providing lightweight function for stream computing, while storage is handed over to DistributedLog. Pulsar is new in stream computing, but it only complements heavyweight frameworks such as Flink and Spark. The author believes that if Pulsar can do it in the IoT scene, there may still be a chance. Pravega is the team acquired by Dell, which does streaming storage and uses Bookeeper internally, mainly for IoT scenarios. The relationship between the four is roughly like this.

Beam, Gearpump, Edgent: the layout of the giant. All three projects have entered the Apache Foundation. Beam is Google, Gearpump is Intel, and Edgent is IBM. The Big three made the layout in advance of convection calculation. Gearpump is a distributed lightweight stream computing with Akka as the core. Akka stream and Akka http modules are famous in the technical circle. In the early days of Spark distributed messaging, Akka,Flink has always used Akka for inter-module messaging. Akka is similar to erlang, using Actor model, making full use of thread pool, responsive, high-performance, flexible, message-driven settings, CPU can also respond to requests without dying, which can be said to be a strange fighter in high-performance computing. Gearpum has not made much progress since the departure of the main force, and it does not perform well in low-power IoT scenarios, and it can not do Flink and Spark. Edgent is created for IoT, embedded in gateways or edge devices, real-time analysis of stream data, is still in the ASF incubation. The Internet of things and edge computing can only be achieved by Top-level cloud manufacturers, and all the major manufacturers have the main platform of IoT, which seems impossible to compete with Edgent alone.

Kafka Stream: Kafka is standard for big data message queues and is based on log append-only. Thanks to zero copy, Kafka has become the first choice for publish and subscribe message queues with high throughput in big data scenario. Today, Kafka, who does not want to be lonely, is also doing stream computing, and Kafka SQL is sufficient to deal with simple flow computing scenarios. However, the separation of computing and storage is the consensus of the industry, and the resource-constrained edge computing scenarios need to consider the integration of computing and storage. The heavyweight Kafka supports flow analysis while storing, which is a bit of an all-embracing. First, the boundaries of storage computing are not clear, all in Kafka; second, the Kafka architecture is obsolete and bulky, and there is still a gap compared with the streaming storage system based on DistributedLog; and it is not as lightweight as Pulsar. There is a big gap between Kafka Stream SQL Wheel Dafa and Flink SQL and Spark SQL. Personally, the crisis outweighs the opportunity.

The further development of real-time streaming computing technology requires the birth of new industry scenarios such as IoT, industrial IoT, intelligent xx series, car networking and so on.

Newcomer Flink

Flink didn't come to the fore until 16 years ago and had to gossip about his family's history of getting rich.

The Stratosphere project was first launched by Volker Markl, a professor at the Polytechnic University of Berlin, Germany, in December 2010, and the main developers include Stephan Ewen and Fabian Hueske. Stratosphere is a system that aims to surpass MapReduce, and at the same time there is Spark from Berkeley AMP Lab at the University of California. Compared to Spark,Stratosphere, it is a completely failed project. So Professor Volker Markl referred to Google's latest streaming computing paper MillWheel and decided to develop a distributed streaming computing engine Flink based on streaming computing. Flink joined the Apache Incubator in March 2014 and graduated to become a top-level Apache program in November 2014.

The integration of flow and batch is based on flow, and batch is a special case of flow or the integration of upper API; batch. It is based on batch calculation, micro-batch as a special case, adhesive simulation flow calculation.

Spark vs. Flink

Ugly words in the front, the author has no intention of provoking the contradiction between Flink and Spark two groups, whether the community learn from each other or plagiarize each other, it is not a thing, the key lies in the benefits of the user group.

At various meetings, we are often asked about the difference between Spark and Flink. How to choose?

The following is analyzed one by one from the aspects of data model, runtime architecture, scheduling, delay and throughput, reverse pressure, state storage, SQL scalability, ecology, applicable scenarios and so on.

Data model

Spark RDD diagram. The picture is from JerryLead's SparkInternals project.

Flink frame diagram

Flink Runtime

Data Model of Spark

Spark first adopted RDD model, which is 100 times faster than MapReduce calculation, and greatly upgraded the ecology of Hadoop. RDD elastic dataset is divided into fixed-size batch data, and RDD provides rich underlying API to operate on the dataset. In order to continuously reduce the threshold for use, the Spark community began to develop high-level API:DataFrame/DataSet,Spark SQL as a unified API, covering the underlying layer, while targeted SQL logic optimization and physical optimization, non-heap storage optimization also greatly improved performance.

The DStream model in Spark Streaming is similar to the RDD model, which divides a real-time infinite data into a small batch data set DStream, and the timer tells the processing system to process the micro-batch data. The disadvantage is very obvious, API is less, it is difficult to be competent for the complex stream computing business, and it is a manual task to increase the throughput without triggering the back pressure. Do not support out-of-order processing, set the previous Kafka topic to 1 partition, chicken thief to alleviate the disorder problem. Spark Streaming is only suitable for simple stream processing and will be completely replaced by Structured Streaming.

Spark Structured Streaming provides both micro-batch and streaming processing engines. Although micro-batch API is not as rich as Flink, it has common capabilities such as window, message time, trigger, watermarker, flow table join, flow join and so on. The delay is still kept at a minimum of 100 milliseconds. The streaming engine, which is currently in the experimental stage, provides a delay of 1 millisecond, but it cannot guarantee exactly-once semantics and supports at-least-once semantics. At the same time, micro-batch jobs take a snapshot, and it is incompatible for jobs to be restarted in streaming mode. This is not as perfect as Flink does.

To sum up, Spark Streaming and Structured Streaming use the idea of batch computing to do flow calculation. In fact, using the idea of flow computing to develop batch computing is the most elegant. For Spark, a big change of blood transfusion is not possible, only local optimization. In fact, the four modules of core, streaming, structured streaming and graphx in Spark are four realization ideas, which are not pure and harmonious through the upper SQL unity.

Data Model of Flink

Flink uses the Dataflow model, which is different from the Lambda model. Dataflow is a graph composed of pure nodes. The nodes in the graph can perform batch computing, stream computing, or machine learning algorithm. Stream data flows between nodes and is processed by real-time apply by processing functions on nodes. Nodes are connected by netty. Keepalive between two netty, network buffer is the key of natural backpressure. After logical and physical optimization, the logical relationship of Dataflow is not much different from the physical topology of the runtime. This is pure streaming design, and the delay and throughput are theoretically optimal.

Flink has no baggage in batch computing and is on the right path from the start.

Runtime architecture

Spark Runtime Architecture

Batch counting divides the DAG into consanguinity between different stage,DAG nodes. During the run, the task task list of one stage is finished, and then the next stage;Spark Streaming is destroyed, which divides the continuous inflow of data into a batch and executes the batch data operation regularly. Structured Streaming stores the infinite input stream in state storage, and the convection data is calculated in micro-batches or in real time, which is similar to the Dataflow model.

Flink Runtime Architecture

Flink has a unified runtime, which can be Batch API, Stream API, ML, Graph, CEP and so on. The nodes in DAG perform the functions of the above modules, and DAG will be transformed into ExecutionGraph step by step, that is, a physically executable graph, and finally handed over to the scheduling system. The logic in the node is executed by apply on the task in the resource pool, similar to task in task and Spark, which corresponds to a thread in the thread pool.

In terms of the runtime architecture of streaming computing, Flink is significantly more unified and elegant.

Delay and throughput

The Yahoo benchmark tested by the two companies, each said it was good for each other. Benchmark chicken rib can not be trusted. According to the test results, the throughput and delay of Flink and Spark are similar.

Reverse pressure

In Flink, the downstream operator consumes the data that flows into the network buffer, and if the downstream operator does not have enough processing capacity, it blocks the network buffer, so that the data cannot be written, and the upstream operator finds that it cannot be written, then the pressure is passed up step by step until the data source, which is a very reasonable way of natural backpressure. Spark Streaming sets the throughput of backpressure, and the current limit begins when the threshold is reached, which is reasonable from the point of view of batch calculation.

State storage

Flink provides file, memory and RocksDB state storage, which can persist the running state data asynchronously. The mechanism of taking snapshots is to send a special savepoint or checkpoint message to the next node of the source node, which flows between each operator. The coordinator mechanism aligns the state data in multiple parallelism operators and persists the state data asynchronously.

The way Flink takes snapshots is the most elegant one I have ever seen. Flink supports local recovery of snapshots. After job snapshot data is saved, jobs are modified, DAG changes are changed, jobs are started to restore snapshots, and the status of unchanged operators in new jobs can still be restored. And Flink also supports incremental snapshots, which can undoubtedly reduce network and disk overhead in the face of large memory state data.

The snapshot API of Spark is the basic capability of RDD. When snapshots are enabled regularly, the entire memory data will be persisted at the same time. Spark is generally oriented to big data set computing, with large memory data, and snapshots should not be too frequent, which will increase the amount of cluster computing.

SQL extensibility

Flink relies on the Stream SQL API of the Apache Calcite project, while Spark is entirely in its own hands and does more performance optimization. There is a consensus in big data's field: SQL is a first-class citizen and SQL is the user interface. The logical and physical optimizations of SQL, such as Cost based optimizer, can be fully optimized at the lower level. UDX can support online machine learning StreamingML, flow graph calculation, flow rule engine and so on on top of SQL. Because of the ubiquity of SQL, it is difficult to have a unified SQL engine to fit all frameworks, and a SQL-like chimney also increases the learning cost for users.

Ecological and applicable scenarios

Spark has more advantages in these two aspects.

Spark has been practiced in major factories for many years and has been running-in with HBase, Kafka and AWS OBS for many years. It has become the de facto standard of big data's computing framework, but there is also pressure from TensorFlow. After 14 years of running machine learning algorithms in the production environment, most of them chose Spark. At that time, our team also mentioned a PR of ParameterServer, and the community gave up when it was slow to follow up. In order to rush to build SQL, the community missed the best opportunity to cut into AI. In the past two years, the momentum of Spark+AI is strong. Professor Matei's paper Weld wants to glue batches, streams, graphs, ML, TensorFlow and other systems together through monad to unify the underlying optimization. The MLFlow project, which is in the beta stage, manages all the life cycle of ML. These are new breakthroughs of Spark.

In contrast, the Flink community has good support for the surrounding big data storage framework, but there is a lack of investment in FlinkML and Gelly graph computing. There has been no progress in bringing up PS and streaming machine learning to the community in the past 16 years. The author has been in Huawei Cloud for more than two years, chose Flink as the core of the stream computing platform, and simply developed advanced features such as StreamingML, Streaming Time GeoSpatial and CEP SQL on the basis of Flink.

Enterprises and developers' choice of big data's AI framework is a heavy technology investment, and it will lose a lot if they make the wrong choice. It depends not only on the framework itself, but also on the company behind it.

Behind Spark is Databricks,Databricks back to Berkeley, Matei, Reynold Xin, Meng Xiangrui and other experts such as cloud. Databricks Platform chose Azure,14 DB to transform the WYSIWYG big data development platform of notebook, which is forward-looking and has good support for AWS at the same time. It is impeccable both commercially and technically.

Flink is followed by DataArtisans, and data Artisans Platform has also been launched this year. I feel that it is not very new and does not have good support for public and private clouds. DataArtisans is a German company with a team of 20 or 30 people who are diligently active in the Flink community and may not have enough power in business.

If the commercial company behind the open source project is not there, the project itself is bound to perish, and a successful open source project cannot be supported by the strength of scattered enthusiasts. Databricks is valued at $140 million and DataArtisans at $6 million, a gap of 23 times. The risk of DataArtisans lies in the ability to cash out, because the plate is small, so there is a great risk of being served. Fortunately, Flink has a good Dataflow foundation. This is also a difficult problem for every open source project, which requires both commercial support and neutral development.

Comparative summary

With so much talk, compare Flink with Spark:

Flink and Spark have their own advantages and disadvantages in flow calculation, and their scores are the same. Flink has matured in streaming batch computing, and Spark still has a lot of room for improvement.

Opportunities for edge computing

The concept of marginal computing has been booming in the past two years, and big data's ability is mainly based on stream computing. Why does edge computing emerge when public, private and hybrid clouds are so mature?

IoT technology has matured rapidly, enabling offline scenarios such as car networking, industry, smart city, O2O and so on. The rapid growth of offline data, the lack of sensitive data on the cloud, the large amount of data on the cloud, and the delay of less than milliseconds have spawned computing close to the edge of the business. On hardware devices with limited resources, business data streams are generated in real time and need to be processed in real time. Generally, lambda can be used to run scripts, and big data can run Flink in real time. Huawei Cloud's commercial IEF edge computing service, which runs on the edge, is Flink lite,Azure 's stream computing, which also supports streaming jobs to be sent to edge devices to run.

Not only scripts and Flink can be run on edge devices, but machine learning and deep learning algorithm reasoning can also be performed. Video cameras can be seen everywhere, 4K high-definition cameras are becoming more and more common, and traffic police Shu Li's ticket is becoming more and more worry-free. If all video streams are uploaded to the data center in real time, the cost is not cost-effective. If the video stream data can complete face recognition, object recognition, license plate recognition, object movement detection, floating object detection, sprinkler detection and so on on or around the camera, and then upload the video clips and detection results, it will greatly save traffic. This has led to the birth of low-power AI chips such as Teng Teng 310, a variety of smart cameras and edge boxes.

Stream computing frameworks such as Flink, which can be agile and slimmed down with undiminished capabilities, are suitable for developing their skills in low-power edge boxes. Can run some CEP rule engine, online machine learning Streaming, real-time anomaly detection, real-time predictive maintenance, ETL data cleaning, real-time alarm and so on.

Industry application scenario

The common application scenarios of real-time flow computing are: log analysis, Internet of things, NB-IoT, smart city, smart factory, car networking, highway freight transport, railway, passenger transport, ladder network, smart home, ADAS advanced assistant driving, shared bike, taxi hailing, takeout, advertising recommendation, e-commerce search recommendation, stock market, financial real-time intelligent anti-fraud and so on. As long as real-time data generation and real-time analysis of data can produce value, then real-time stream computing technology can be used to simply write scripts and develop applications, which has been unable to meet the needs of these complex scenarios.

The more real-time data computing is, the more valuable it is, and the batch computing value created by Hadoop has been drained. Online machine learning, online graph computing, online deep learning, online automatic learning, online transfer learning and so on all have the shadow of real-time stream computing. For offline learning and offline analysis application scenarios, you can ask, if it is real-time, can it produce more value?

If you go to Xinbailu to order with a QR code, you will enjoy quick serving and online checkout; order a takeout and take a taxi, and if you wait for ten minutes, you have to cancel the order. Internet catalysis in various industries, real-time computing is the trend, has been in all aspects of life, production and environment.

Compare the stream computing services of various cloud vendors

It has become a consensus in the industry not to repeat the construction of wheels. It will become a new industry consensus to use the public cloud serverless big data AI service (fully hosted, fee-on-demand, OPS-free). High-growth enterprises need high cost and long-term maintenance cost to build big data AI infrastructure.

There are three main concerns in the cloud for enterprises:

Data security, which belongs to the core assets of the enterprise.

Be locked by the manufacturer

Weaken one's own technical ability.

For data security, the domestic "Network Security Law" has been formally implemented, and there is a law to follow for personal privacy data protection; in addition, the EU GDPR "General data Protection regulations (General Data Protection Regulation)" has come into effect, which shows that the law has to control data chaos.

Choosing a neutral cloud vendor is critical. Most cloud vendors will choose open source systems as the cornerstone of cloud services. If they are worried about being locked down, users should pay attention to the kernel when choosing cloud services. Of course, this will lead to conflicts between the open source community and cloud vendors. Providing an enterprise-oriented big data platform may be robbed by the public cloud of business. The open source community wants to survive. The example of cooperation between DataBricks and Azure is a smart choice.

There is no need to worry about weakening the company's technical capabilities. In the future, big data framework will become more and more stupid, and the threshold for operation and maintenance will be lower and lower. Enterprises might as well focus on creating value with big data, instead of playing with data for the sake of make more money.

Current common stream computing services include:

AWS Kinesis

Azure flow analysis

Huawei Cloud Real-time streaming Computing Service

Aliyun real-time computing

AWS Kinesis streaming computing service was launched earlier and is now more mature. It provides serverless capability, charge on demand, full hosting, and dynamic capacity expansion and reduction. It is a more profitable product for AWS. Kinesis consists of four parts: Data Streams, Data Analytics, Data Firehose and Video Streams. Data Streams does data access, Data Firehose does data loading and dump, Data Analytics does real-time stream data analysis, and Video Streams is used for streaming media access, codec and persistence. Azure also does a good job of flow analysis, focusing on IoT and edge computing scenarios. From Kinesis and Azure flow analysis, we can see that IoT is the main battlefield of flow analysis. Although the product is good, it is not widely used in China, and the data center is limited and expensive.

Huawei Cloud Real-time streaming Computing Service is a serverless streaming computing service with Flink and Spark as the core. Huawei started its own StreamSmart products as early as 2012 and is widely delivered overseas. Because of the ecological closure, the team abandoned StreamSmart and switched to Flink and Spark twin engines. Provide StreamSQL-based product features: CEP SQL, StreamingML, Time GeoSpartial time and geographic location analysis, real-time visualization and other advanced features. It is the first to create an exclusive cluster mode to provide physical isolation between users. Even two competitors can use real-time streaming computing services at the same time. The physical isolation between users also cuts off the desire to break through the sandbox between users.

Aliyun's stream computing service, originally based on Storm's galaxy system, is also based on StreamSQL, and its products were lukewarm in the early years. Since the complete transformation of stream computing last year, the kernel has been changed to Flink. After the double 11 traffic test, it is now more active.

Summary & Prospect

Real-time streaming computing technology is mature and everyone can rest assured to use it. The current problem lies in the promotion of application scenarios, enhance the trust of enterprises to cloud vendors, and widely use flow computing to create value. The combination of stream computing and AI will also be the possible direction in the future:

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.