Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the development process of Kafka-on-Pulsar?

2025-02-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the development process of Kafka-on-Pulsar. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

Protocol Handler is a new mechanism added after Pulsar version 2.5.0. It is hoped that developers can make use of the existing infrastructure of Pulsar to regard Pulsar as a reliable, efficient and stable streaming data storage, and can use it to develop some pluggable message protocols.

So Kafka-on-Pulsar is developed based on Protocol Handler and supports plug-ins for the Kafka 2.0 protocol. Simply download the KoP plug-in and install it into an existing Pulsar broker to support the Kafka protocol in Pulsar. The advantage is that it simplifies the process of migrating from Kafka to Pulsar without changing the code twice and migrating directly and seamlessly. Next, let's take a closer look at the development process of KoP.

What is Apache Pulsar?

Apache Pulsar is an event streaming platform. Initially, Apache Pulsar adopted a cloud-native, hierarchical sharding architecture. The architecture separates the service from the storage, which makes the system more friendly.

Now, Pulsar is not only a message middleware, but also a system that combines message and stream data, namely Cloud-Native Event Streaming.

We have written a lot of details about Pulsar before. If you are interested, you can check out: Apache Pulsar introduction.

Why KoP?

Plusar provides a unified messaging model for queue and flow workloads. Pulsar supports its own protobuf-based binary protocol to ensure high performance and low latency. Protobuf facilitates the implementation of Pulsar clients.

Moreover, the project also supports Java,Go,Python and C + + languages as well as third-party clients provided by the community. However, for applications written using other messaging protocols, users must rewrite them, or they cannot adopt Pulsar's new Unified messaging protocol.

To address this problem, the Pulsar community has previously developed applications to migrate Kafka applications from other messaging systems to Pulsar. For example, Pulsar provides Kafka wrapper on Kafka Java API.

Kafka wrapper allows users to switch the Kafka Java client application they use from Kafka to Pulsar without changing the code. Pulsar also provides a rich connector ecosystem for connecting Pulsar and other data systems.

However, there is still a strong demand for users who want to switch from other Kafka applications to Pulsar.

The background of KoP's birth

As a result, the idea of "supporting the Kafka protocol on Pulsar" came into being. The initial guess is to add a proxy, for example, many companies will add a similar HTTP proxy before the Kafka, and then convert it to the Pulsar protocol.

The second guess is whether the Kafka protocol can be directly connected to Pulsar broker, that is, the current formation of KoP.

So what does the first proxy approach look like if it is implemented? OVHcloud made one attempt.

Previously, OVHcloud has been using Apache Kafka. Despite their experience running multiple clusters on Kafka and processing millions of messages per second, they still face daunting operational challenges. So OVHcloud abandoned Kafka and decided to move the products it serves to Pulsar and build its products on Pulsar.

But in order to take care of users who are still using Kafka, they want to add a proxy to Pulsar to support the Kafka protocol. Their initial approach was to convert a frame of the Kafka protocol into a Pulsar protocol.

Proxy receives any frame from the Kafka client and converts it to the corresponding interface of the Pulsar through a free state machine.

One type of state machine is used to receive Kafka requests, and the second is used to process Pulsar response. Then add a state machine in the middle for synchronization.

Because these operations are done at the TCP layer, it performs well. With the characteristics of Rust, the overall operation is smooth. But in this case, the code still needs to be written line by line, and there is no way to implement some of the Kafka protocol through proxy. For example: group coordinator and offsets management.

Another key point is that it is difficult to open source because it is written in Rust. Even if it is open source, it is difficult to plug into the Pulsar system as a component.

Just last year, a tweet from StreamNative caught the attention of OVHcloud. This is the KoP demo shared by teacher Zhai Jia when StreamNative held its first offline Pulsar meetup.

After several exchanges of experiences between the two sides, the two sides jointly launched a more perfect "KoP." Take advantage of Pulsar and BookKeeper's event stream storage architecture and Pulsar's pluggable protocol processing plug-in framework to provide a concise and comprehensive solution.

KoP components collaborate with Broker

So when we go back and look at the Pulsar architecture, the core ones in the module diagram below: Broker, BookKeeper, ZooKeeper. Pulsar is a set of distributed streaming storage based on Managed ledger, including how to store data, how to prevent data loss, how to copy streams from local computer room to another computer room, and so on.

The Pulsar protocol itself is a very lightweight thing, namely Pulsar protocol handler in the figure above. It mainly deals with the request format from TCP, and then converts and reads the request. Therefore, the core part of Pulsar protocol is at the storage level, distributed equilibrium level and so on.

Abstract the Pulsar protocol handler into a framework / interface. Using this framework, you can directly access the storage system that Pulsar has built, and all that is left is protocol parsing and conversion.

Therefore, according to this idea, the Kafka protocol is brought into practice. In Pulsar version 2.5, a new concept of "Pluggable protocol handler" (PIP-41) was added to separate the interface.

The use of Pulsar protocol handler is similar to Pulsar function/connector, and simply insert it into the Pulsar broker to give Pulsar the ability to read and parse other protocols. This mechanism only needs to adjust two configurations:

After the configuration is complete, restart the cluster to support the processing power of "other types of protocols". Of course, this feature is only supported after Pulsar version 2.5, so if you need to try, you can upgrade your Pulsar system to version 2.5 first.

So under this mechanism, the process will become clearer and simpler. You only need to implement Kafka protocol handler in Pulsar, and the green part of the solid line above is the native client of Kafka. Simply connect the data to the Pulsar cluster and you can process the Kafka request.

Why choose Kafka as the practice object?

Because Pulsar and Kafka have a lot in common at some levels. For example, in the log layer, Pulsar and Kafka use a very similar data model for publishing / subscribing to messages and event flows, and Pulsar and Kafka both use distributed logging.

By comparing Pulsar and Kafka, we find that there are many similarities between the two systems. Both systems include the following operations:

Topic search

All clients connect to any broker to find the Topic's metadata (that is, owner broker). After the metadata is obtained, the client establishes a persistent TCP connection with the owner broker. Publish

The client talks to the owner broker in the Topic zone to append the message to the distributed log. Consumption

The client talks to the owner broker of the Topic partition to read messages from the distributed log. Offset

Assigns an offset to messages published to the Topic partition. In Pulsar, the offset is called MessageId. Consumer can use offsets to find a given location in the log in order to read messages.

Consumption status

Both systems maintain the consumption status of the consumer in the subscription (what Kafka calls a consumer group). Kafka stores consumption status in `_ _ offsets` Topic, while Pulsar stores consumption status in `cursors`.

Mode of realization

1. Topic

Kafka stores all Topic in a flat namespace. However, Pulsar stores Topic in a hierarchical, multi-tenant namespace. We added the `kafkaNamespace` configuration to the broker configuration so that the administrator can map Kafka Topic to Pulsar Topic.

To make it easier for Kafka users to use the multi-tenancy feature of Apache Pulsar, when Kafka users use the SASL authentication mechanism to authenticate Kafka clients, they can specify a Pulsar tenant and namespace as their SASL user name.

two。 Message ID and offset

Kafka specifies an offset for each message that is successfully published to the Topic partition. Pulsar specifies an `MessageID` for each message. The message ID consists of `ledger- id`, `batch- id` and `batch- index`. We use the same method in Pulsar-Kafka wrapper to convert Pulsar's message ID to offset, and vice versa.

3. Message

Both Kafka and Pulsar messages contain keys, values, timestamps, and header (called 'properties' in Pulsar). We automatically convert these fields between Kafka messages and Pulsar messages.

4. Topic search

We provide the same Topic lookup method for request processing plug-ins for Kafka and Pulsar. The request processing plug-in discovers the Topic, finds the full ownership of the requested Topic partition, and then returns the Kafka `TopicMetadata` containing ownership information to the Kafka client.

5. Publish a message

When a message is received from the Kafka client, the Kafka request processing plug-in maps multiple fields (such as key, value, timestamp, and headers) one by one, thereby transforming the Kafka message into an Pulsar message.

At the same time, the Kafka request processing plug-in uses ManagedLedger append API to store these converted Pulsar messages in BookKeeper. After the Kafka request processing plug-in converts Kafka messages into Pulsar messages, existing Pulsar applications can receive messages published by Kafka clients.

6. Consumption message

When a consumer request is received from a Kafka client, the Kafka request processing plug-in opens a non-persistent cursor and then reads the entries from the offset of the request.

After the Kafka request processing plug-in converts the Pulsar message back to the Kafka message, the existing Kafka application can receive the message published by the Pulsar client.

7. Group coordinator & offset management

The biggest challenge is to implement group coordinator and offset management. Pulsar does not support centralized group coordinator, cannot assign partitions to consumer in consumer groups, and cannot manage offsets for each consumer group.

Pulsar broker manages partition assignments based on partitions, while partition owner broker manages offsets by storing confirmation information in cursors.

It is difficult to make the Pulsar model consistent with the Kafka model. Therefore, in order to be fully compatible with the Kafka client, we store the changes and offsets of coordinator group in the Pulsar named `public/kafka/__ offsets` system Topic, thus implementing Kafka coordinator group.

In this way, we can build a bridge between Pulsar and Kafka and allow users to use existing Pulsar tools and policies to manage subscriptions and monitor Kafka consumer. We add a background thread to the implemented coordinator group to periodically synchronize offset updates from the system Topic to the Pulsar cursor.

So, in fact, the Kafka consumer group is considered a Pulsar subscription. All existing Pulsar tools can also be used to manage Kafka consumer groups.

KoP production

If you apply KoP to a real scenario, you need to consider the following aspects:

Multi-tenant security replication tiered storage Schema across computer rooms integrates with existing data environments such as Flink, Spark, Presto

Q & A

1. Pulsar has a variety of extensions. Is there a unified way to manage these extensions?

Currently working on a project: Pulsar Registry, similar to DocHub. It can also be seen as an app store with a collection of components / plug-ins, which you can look forward to.

2. Can versions below 0.11 of Kafka be smoothly upgraded to a higher version? If the format of the message changes, is it impossible to upgrade smoothly?

No, version 0.10 and above can only be upgraded smoothly.

The ultimate goal of KoP is to facilitate users to migrate existing applications on Kafka to Pulsar, and to make it easier for users to build products through KoP. In the future, KoP will also increase support and multi-compatibility for schema and Kafka versions.

The above is the development process of Kafka-on-Pulsar shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report