How to use Pravega in Flink 11/03 Update SLTechnology News&Howtos

How to use Pravega in Flink

2025-11-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to use Pravega in Flink". The content is simple and clear. I hope it can help you solve your doubts. Let the editor lead you to study and learn this article "how to use Pravega in Flink".

The introduction of Pravega, the advanced features of Pravega and the usage scenario of vehicle networking introduce Pravega, focusing on why DellEMC develops Pravega,Pravega to solve which pain points of big data processing platform and what kind of spark will collide with Flink.

Big data's structural changes

The pain of Lambda Architecture

How to extract and provide data effectively is the key to the success of big data's application architecture. Due to the difference in processing speed and frequency, data intake needs to be carried out through two strategies. The figure above is a typical Lambda architecture: big data's processing architecture is divided into two separate computing infrastructure: batch processing and real-time streaming processing.

For real-time processing, data from sensors, mobile devices, or application logs are usually written to a message queuing system (such as Kafka), which is responsible for providing temporary buffering of data for streaming applications. Then use Spark Streaming to read data from Kafka and do real-time stream calculation. However, because Kafka does not keep historical data all the time, if users' business logic is combined with historical data and real-time data for analysis at the same time, then there is no way to complete this pipeline. Therefore, in order to compensate, it is necessary to open up an additional batch pipeline, that is, the "Batch" part of the figure.

For the batch pipeline, there are a lot of open source big data components such as ElasticSearch, Amazon S3, HDFS, Cassandra and Spark. The main computing logic is to achieve large-scale Map-Reduce operations through Spark. The advantage is that the results are more accurate, because all historical data can be combined for calculation and analysis, and the disadvantage is that the delay will be relatively large.

This classic big data processing architecture can sum up three problems:

The processing delay of the two pipelines is quite different, so it is impossible to combine the two pipelines for rapid aggregation operation at the same time, and the processing performance of combining historical data and real-time data is low. The cost of data storage is high. In the architecture above, there will be one or more copies of the same data in multiple storage components, and the redundancy of the data will undoubtedly greatly increase the cost of enterprise customers. And the data fault tolerance and persistence reliability of open source storage has always been debatable. For data security-sensitive enterprise users, it is necessary to strictly ensure that the data is not lost. Repeat development. The same processing flow is carried out twice by two pipelines, and the same data has to be calculated in different frameworks only because of the different processing time, which will undoubtedly increase the burden of repeated development for data developers.

Characteristics of streaming storage

Before formally introducing Pravega, let's briefly talk about some of the features of streaming data storage.

If we want to unify the big data processing architecture of streaming batch processing, there are actually hybrid requirements for storage.

For the historical data from the old part of the sequence, it is necessary to provide high throughput read performance, that is, for the real-time data from the new part of the sequence, catch-up read needs to provide low-latency append-only tail-writing tailing write and tail-read tailing read.

Reconstructed streaming storage architecture

For distributed storage components such as Kafka,Cassandra, their storage architecture follows the pattern from proprietary log storage to local files to distributed storage on the cluster from top to bottom.

The Pravega team tries to reconstruct the architecture of streaming storage, introducing the abstract concept of Pravega Stream as the basic unit of streaming data storage. Stream is a named, persistent, append-only, infinite sequence of bytes.

As shown in the figure above, the lowest level of the storage architecture is based on scalable distributed cloud storage. The middle layer indicates that log data is stored as Stream as a shared storage primitive, and then operations with different functions can be provided based on Stream, such as message queue, NoSQL, full-text search of streaming data, and real-time and batch analysis combined with Flink. In other words, the Stream primitive provided by Pravega can avoid the data redundancy caused by the original data moving in multiple open source storage search products in the existing big data architecture, and it completes a unified data lake in the storage layer.

The reconstructed big data architecture

Our proposed big data architecture uses Apache Flink as the computing engine and unifies batch and stream processing through a unified model / API. Pavega is used as the storage engine to provide a unified abstraction for streaming data storage, so that historical and real-time data can be accessed in a consistent way. Both of them form a closed loop from storage to computing, which can cope with high throughput historical data and low latency real-time data at the same time. At the same time, the Pravega team also developed Flink-Pravega Connector, which provides Exactly-Once semantics for a complete set of pipelining for computing and storage.

Introduction to Pravega

Pravega is designed to provide a solution for real-time storage of streams. The application persists the data to Pravega, and the Stream of Pravega can be stored indefinitely and persisted for any long time. The same Reader API provides tail read and catch-up read functions, which can effectively meet the unity of offline computing and real-time computing.

Basic concepts of Pravega

The basic concepts of Pravega are briefly introduced in combination with the above figure:

Stream

Pravega organizes the written data into an infinite sequence of bytes that Stream,Stream is named, persistent, appended only.

Stream Segments

Pravega Stream is divided into one or more Segments, which is equivalent to data fragments in Stream, which is a data block of append-only, and Pravega also implements automatic auto scaling based on Segment. The number of Segment is also updated automatically and continuously according to the flow of the data.

Event

Pravega's client API allows users to write and read data in basic units of Event, and Event is specifically a collection of Stream internal byte streams. For example, a temperature record of an IOT sensor can be understood as an Event when it is written into Pravega.

Routing Key

Each Event has a Routing Key, which is a user-defined string used to group similar Event. All Event with the same Routing Key will be written to the same Stream Segment. Pravega provides read-write semantics through Routing Key.

Reader Group

It is used to realize the load balancing of reading data. The concurrency of read data can be changed by dynamically increasing or decreasing the number of Reader in the Reader Group. For more details, please refer to the Pravega official documentation:

Http://pravega.io/docs/latest/pravega-concepts

Pravega system architecture

At the control level, Controller, as the master node of the Pravega cluster, manages the Segment Store at the data level, providing operations such as creation, update and deletion of streaming data. At the same time, it also undertakes the functions of real-time monitoring the health status of the cluster, obtaining flow data information, collecting monitoring indicators and so on. Usually there are 3 copies of Controller in the cluster to ensure high availability.

At the data level, Segment Store provides API for reading and writing data within the Stream. In Pravega, data is stored in layers:

Tier 1 Stora

Tier1 storage is usually deployed within Pravega clusters, mainly providing storage for low-latency, short-term hot data. There is Cache at each Segment Store node to speed up the data reading rate, and Pravega uses Apache Bookeeper to ensure a low-latency log storage service.

Long-term storage

The storage of Long-term is usually deployed outside the Pravega cluster, mainly to provide long-term storage of convective data, that is, cold data storage. Support not only HDFS,NFS, but also enterprise-class storage products such as Dell EMC's ECS,Isilon.

Pravega advanced characteristic

Separation of reading and writing

In the Tier1 storage part, when writing data, the Bookkeeper ensures that the data has been dismounted in all Segment Store, which ensures that the data is written successfully.

Read-write separation helps optimize read-write performance: read only from Tier1's Cache and Long-term storage, not Bookkeeper in Tier1.

When the client initiates a request to read data to Pravega, Pravega will decide whether the data should be low-latency tail-read from Tier1's Cache or go to Long-term 's long-term storage data (object storage / NFS) for a high-throughput catch-up read (if the data is not in Cache, you need to load to Cache as needed). The read operation is transparent to the client.

Tier1's Bookkeeper never reads without cluster failure, only writes.

Elastic expansion

The number of Segment in Stream will stretch automatically according to the IO load. The above figure is a simple example:

The data stream is written to Pravega at T0, and the data is routed to Segment0 and Segment1 according to the routing key. If the data write speed remains constant, then the number of Segemnt will not change. At T1, the system senses the acceleration of segment1 data writing rate, so it is divided into two parts: Segment2 and Segment3. At this time, Segment1 will enter the Sealed state and will no longer accept writing data, and the data will be redirected to Segment2 and Segment3 respectively according to the routing key. Corresponding to the Scale-Up operation, the system can also provide Scale-Down operation according to the slower data writing speed. For example, at T3, the write traffic of system Segment2 and Segment5 is reduced, so it is merged into a new Segment6.

End-to-end elastic expansion

Pravega uses Kubernetes Operator to deploy stateful applications for each component of the cluster, which can make the elastic scaling of applications more flexible and convenient.

Pravega has also been working deeply with Ververica recently to achieve Kubernetes Pod-level auto scaling on the Pravega side and through the number of Task of rescaling Flink on the Flink side.

Transactional writing

Pravega also provides transactional write operations. Before committing the transaction, the data is written to different Transaction Segment according to the routing key, at which time the Segment is not visible to the Reader. The Transaction Segment is appended to the end of the Stream Segment only after the transaction is committed, when the Segment is visible to the Reader. Support for write transactions is also the key to implementing end-to-end Exactly-Once semantics with Flink.

Pravega vs. Kafka

First of all, the most critical difference lies in the location of the two: the location of Kafka is message queue, while the location of Pravega is storage, which will pay more attention to the dynamic scaling, security, integrity and other storage features of data.

For streaming data processing, data should be considered continuous and infinite. As a message queue based on the local file system, Kafka simulates unlimited data flow by adding to the end of the log file and tracking its contents (offset mechanism). However, this approach must be limited by the upper limit of the file descriptor and disk capacity of the local file system, so it is not unlimited.

The comparison of the two gives a more detailed summary in the figure and will not be repeated.

Pravega Flink Connector

To make it easier to use with Flink, we also provide Pravega Flink Connector (the https://github.com/pravega/flink-connectors), Pravega team also plans to contribute this Connector to the Flink community. Connector provides the following features:

Exactly-once semantic guarantee is provided for both Reader and Writer, ensuring the seamless coupling of the end-to-end Exactly-Once of the entire pipeline and Flink's checkpoints and savepoints mechanisms to support high throughput and low latency concurrent read and write Table API to unify the stream batch processing of Pravega Sream.

Use scenario of vehicle networking

Take the application scenario of self-driving vehicle networking, which can generate massive PB-level data, as an example:

It is necessary to do real-time processing of vehicle condition data in order to make microscopic prediction and planning of route planning in time. It is necessary to run machine learning algorithm for longer-term driving data to make macro prediction and planning of route. This belongs to batch processing, which requires the combination of real-time processing and batch processing. The machine learning model generated by historical data and real-time data feedback are used to optimize the test results.

The key indicators that customers focus on are as follows:

How to ensure efficient end-to-end processing speed? how to minimize the training time of machine learning model? how to reduce the consumption and cost of stored data as much as possible?

The following is a comparison of solutions before and after the introduction of Pravega.

Solution comparison

The introduction of Pravega undoubtedly greatly simplifies the architecture of big data's processing:

Pravega as an abstract storage interface, data in the Pravega layer to achieve a data lake: batch processing, real-time processing and full-text search only need to get data from Pravega. Only one copy of data is stored in Pravega, instead of redundant data storage in Kafka,ElasticSearch and Long Term Storage as in the first scenario, which can greatly reduce the cost of enterprise user data storage. Pravega can provide automatic Tier Down without introducing components such as Flume for additional ETL development. Components have been streamlined, from the original Kafka+Flume+HDFS+ElasticSearch+Kibana+Spark+SparkStreaming to Pravega+Flink+Kibana+HDFS, reducing the pressure of operation and maintenance personnel. Flink can provide unified flow batch processing without the need to provide two separate sets of processing code for the same data.

The above is all the contents of the article "how to use Pravega in Flink". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.