In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what are the differences between Pulsar and Kafka". In daily operation, I believe many people have doubts about the differences between Pulsar and Kafka. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what are the differences between Pulsar and Kafka?" Next, please follow the editor to study!
module
Pulsar consists of three main components: broker, Apache BookKeeper and Apache ZooKeeper. Broker is a stateless service, and clients need to connect to broker for core messaging. BookKeeper and ZooKeeper are stateful services. The BookKeeper node (bookie) stores messages and cursors, while ZooKeeper is only used to store metadata for broker and bookie. In addition, BookKeeper uses RocksDB as an embedded database to store internal indexes, but the management of RocksDB is not independent of BookKeeper.
Kafka uses a monolithic architecture model that combines services with storage, while Pulsar uses a multi-tier architecture that can be managed in a separate layer. Broker in Pulsar computes on one layer, while bookie manages stateful storage on another.
Pulsar's multi-tier architecture may seem more complex than Kafka's monolithic architecture, but the reality is not that simple. Architecture design needs to weigh the pros and cons, and BookKeeper makes Pulsar more scalable, less operationally burdensome, faster, and more consistent. Later on, we will discuss the above points in detail.
Storage architecture
Pulsar's multi-tier architecture affects the way it stores data. Pulsar divides the topic partition into shards and stores these shards on the storage nodes of the Apache BookKeeper to improve performance, scalability, and availability.
Pulsar's infinitely distributed logs are shredded as the center, implemented with extended log storage (through Apache BookKeeper), and have built-in hierarchical storage support, so shards can be evenly distributed on storage nodes. Because the data related to any given topic is not bundled with a specific storage node, it is easy to replace the storage node or scale down and expand. In addition, the smallest or slowest node in the cluster does not become a deficiency in storage or bandwidth.
Pulsar's architecture has no partitioning and no rebalancing, ensuring timely scalability and high availability. These two important features make Pulsar particularly suitable for building services related to critical tasks, such as billing platforms for financial use cases, transaction processing systems for e-commerce and retailers, and real-time risk control systems for financial institutions.
By taking advantage of the powerful Netty architecture, the transfer of data from producers to broker and then to bookie is zero copy and no copy is made. This feature is very friendly to all flow use cases because the data is transferred directly over the network or disk without any performance loss.
News consumption
The consumption model of Pulsar adopts the way of stream pull. Stream pull is an improved version of long polling, which not only implements zero wait between a single call and request, but also provides two-way message flow. Through the stream pull model, Pulsar achieves lower end-to-end latency than all existing long-polling messaging systems such as Kafka.
Simple operation and maintenance
When assessing the operational simplicity of a particular technology, consider not only the initial setup, but also long-term maintenance and scalability. The following items need to be considered:-is it fast and convenient to expand the cluster in order to keep up with business growth? -is the cluster available out of the box for multi-tenancy (corresponding to multi-team, multi-user)? -will operation and maintenance tasks (such as replacing hardware) affect the availability and reliability of the business? -can data be easily replicated to achieve geographic redundancy or different access patterns?
Users who have been using Kafka for a long time will find that none of the above questions are easy to answer when operating and maintaining Kafka. Most of these tasks require tools other than Kafka, such as cruise control for managing cluster rebalancing and Kafka mirror-maker for replication requirements.
Because Kafka is difficult to share among teams, many organizations have developed tools to support and manage multiple different clusters. These tools are critical to successful large-scale use of Kafka, but they also add complexity to Kafka. The most suitable tools for managing Kafka clusters are commercial software that does not open source. Then it is not surprising that due to the complex management and operation and maintenance of Kafka, many enterprises turn to buy Confluent commercial services.
By contrast, the goal of Pulsar is to simplify operations and be extensible. Based on the performance of Pulsar, we give the following answers to the above questions:
Is it fast and convenient to expand the cluster in order to keep up with the speed of business growth?
Pulsar's automatic load balancing feature automatically and immediately uses the new computing and storage capacity in the cluster. This allows topic to be migrated between broker to balance the load, and new bookie nodes can immediately accept write traffic from new data shards without having to manually rebalance or manage broker.
Is the cluster available out of the box for multi-tenancy (corresponding to multi-team, multi-user)?
Pulsar uses a hierarchical architecture, and tenants and namespaces can be logically mapped to institutions or teams. Pulsar supports simple ACL, quotas, autonomous service control, and even resource isolation through the same institutions, allowing cluster users to easily manage shared clusters.
Do operation and maintenance tasks (such as replacing hardware) affect the availability and reliability of the business?
Stateless broker that replaces Pulsar is easy to operate without worrying about data loss. Bookie nodes automatically copy all unreplicated data fragments, and the tools used to release and replace nodes are built-in tools that are easy to automate.
Can data be easily replicated to achieve geographic redundancy or different access patterns?
Pulsar has a built-in replication function that can be used to seamlessly synchronize data across geographic areas or replicate data to other clusters for other functions (such as disaster recovery, analysis, etc.).
Compared with the characteristics of Kafka,Pulsar, it provides a more complete solution to the practical problems of streaming data. From this perspective, Pulsar has a more complete core feature set and is easy to use, thus allowing users and developers to focus on the core needs of the business.
Documentation and Learning
Because Pulsar is a newer technology than Kafka, its ecosystem is not perfect, and documentation and training resources are still being supplemented. However, this is also the main development direction of Pulsar in the past year and a half. Here are some of the main results:
At the first global summit of Pulsar Summit Virtual Conference 2020 Pulsar, speakers from more than 25 organizations shared a total of 36 times, with more than 600 registered participants.
The original 50 + video and training section has been created in 2020.
Pulsar weekly live broadcast and interactive tutorials.
Leading lecturers in the industry provide professional training.
Hold a monthly webinar with strategic business partners.
Publish white papers on graffiti, OVHCloud, Tencent, Yahooqing Japan, and other use cases.
For more information about Pulsar documentation and training, see StreamNative's Resources website.
Enterprise support
Both Kafka and Pulsar provide enterprise-level support. Several large vendors, including Confluent, provide enterprise-level support for Kafka. StreamNative provides enterprise-level support for Pulsar, but StreamNative is still in its infancy. StreamNative provides enterprises with fully hosted Pulsar cloud services and Pulsar enterprise support services.
The StreamNative team is experienced and growing rapidly in the flow of news and events. StreamNative is created by Pulsar and BookKeeper core members. With the help of the StreamNative team, the Pulsar ecosystem has grown by leaps and bounds in just a few years, and with the support of strategic partners, this support will further facilitate the development of Pulsar to meet the needs of a large number of use cases (which will be described in more detail in the next article).
Recent major developments in Pulsar, such as KoP, or Kafka-on-Pulsar, were launched by OVHCloud and StreamNative in March 2020. By adding KoP protocol handlers to an existing Pulsar cluster, users can migrate existing Kafka applications and services to Pulsar without modifying the code. In June 2020, China Mobile and StreamNative announced the launch of another important product-AoP (AMQP on Pulsar). AoP enables RabbitMQ applications to take advantage of the features of Pulsar, such as using Apache BookKeeper and tiered storage to support unlimited event stream storage. This part will be described in detail in the next part.
Ecological integration
With the rapid increase in the number of Pulsar users, the Pulsar community has developed into a large, highly involved, global community. The number of peripheral tool plug-ins in the Pulsar ecosystem is growing rapidly, and the active Pulsar community plays an extremely important role. Over the past six months, the number of officially supported connector in the Pulsar ecosystem has increased dramatically.
To further support the development of the Pulsar community, StreamNative recently launched StreamNative Hub. StreamNative Hub supports users to find and download integration. This platform will help accelerate the development of the Pulsar connector and plug-in ecosystem.
The Pulsar community has also been actively working closely with other communities to integrate projects on both sides. For example, the Pulsar community and the Flink community have been jointly developing Pulsar-Flink Connector (part of FLIP-72). With Pulsar-Spark Connector, users can use Apache Spark to handle processing events in Apache Pulsar. The SkyWalking Pulsar plug-in integrates Apache SkyWalking and Apache Pulsar to allow users to track messages through SkyWalking. In addition, there are many integration projects under way in the Pulsar community.
Multiple client library
Pulsar currently officially supports seven languages, while Kafka supports only one. The Confluent blog points out that Kafka currently supports 22 languages, but official clients do not support so many languages, and some languages are no longer maintained. According to the latest statistics, the official Apache Kafka client only supports one language, while the official Apache Pulsar client supports seven languages. -Java-C-C++-Python-Go-.NET-Node
Pulsar also supports many clients developed by the Pulsar community, such as-Rust-Scala-Ruby-Erlang
Performance and availability Throughput, latency and capacity
Both Pulsar and Kafka are widely used in multiple enterprise use cases and each has its own advantages, both of which can handle large amounts of traffic with roughly the same number of hardware. Some users mistakenly think that Pulsar uses more components, so more servers are needed to achieve performance comparable to Kafka. While this idea does apply to some specific hardware configurations, Pulsar has more advantages in most of the same resource configurations and can achieve more performance with the same resources.
For example, Splunk recently shared their reasons for choosing Pulsar over Kafka, mentioning that Pulsar helped them reduce costs by 1.5-2 times, latency by 5-50 times, and operating costs by 2-3 times because of the hierarchical architecture (slide 34). The Splunk team found that this is because Pulsar can make better use of disk IO, reduce CPU utilization, and have better control of memory.
Companies such as Tencent chose Pulsar largely because of the performance properties of Pulsar. Tencent's billing platform has millions of users, manages about 30 billion third-party escrow accounts, and is currently using Pulsar to process hundreds of millions of dollars in daily transactions, according to a white paper on Tencent's billing platform. Considering Pulsar's predictable low latency, greater consistency and durability guarantee, Tencent chose Pulsar over Kafka.
Orderly guarantee
Apache Pulsar supports four different subscription models. The subscription model of a single application is determined by sorting and consumption scalability requirements. The following are the four subscription models and the related ordering guarantees:
Both exclusive and disaster recovery subscription models support strong ordering guarantee at the partition level, and support parallel consumption of messages on the same topic across consumer.
The shared subscription model supports expanding the number of consumer to more than the number of partitions, so this model is well suited to the worker queue use case.
The key-sharing subscription model combines the advantages of other subscription models and supports the expansion of the number of consumer to exceed the number of partitions, as well as strong sorting guarantee at the key level.
For more information about the Pulsar subscription model and related ordering guarantees, see subscriptions.
Feature built-in flow processing
Pulsar and Kafka have different goals for built-in stream processing. To meet the complex requirements of stream processing, Pulsar integrates two mature stream processing frameworks, Flink and Spark, and develops Pulsar Functions to deal with lightweight computing. Kafka develops and uses Kafka Streams, rare's stream processing engine.
However, using Kafka Streams is a bit more complicated. Users need to figure out the scenarios and methods for using KStreams applications. Also, KStreams is too complex for most lightweight computing use cases.
In addition, Pulsar Functions easily implements lightweight computing use cases and allows users to create complex processing logic without having to deploy separate proximity systems. Pulsar Functions also supports native languages and easy-to-use API. Users can write event flow applications without having to learn complex API.
Function Mesh is introduced in a recently submitted Pulsar improvement Plan (Pulsar Improvement Proposal,PIP). Function Mesh is an event flow framework with serverless architecture that combines multiple Pulsar Functions to facilitate the construction of complex event flow applications.
Exactly-Once processing
Currently, Pulsar supports exactly-once producer through the broker side. This important project is under development, please look forward to it!
Pulsar supports transactional message flows since PIP-31 and is still under development. This feature improves the message passing semantics and processing assurance of Pulsar. In a transactional flow, each message is written and processed only once, and there is no data duplication or data loss even if the broker or Function instance fails. Transactional messages not only simplify writing to applications using Pulsar or Pulsar Functions, but also extend the range of use cases supported by Pulsar. The development of this feature of Pulsar is progressing smoothly and will be released in Pulsar version 2.7.0, which is expected to be released in September 2020.
Topic (log) compression
Pulsar is designed to support users' consumption of data. Applications can choose to use raw data or compressed data. Through this on-demand selection, Pulsar allows uncompressed data to control unlimited growth through retention policies, but still allows periodic compression to generate up-to-date materialized views. The built-in tiered storage feature enables Pulsar to unload uncompressed data from BookKeeper to cloud storage, thereby reducing the cost of long-term storage events.
Compared with Pulsar,Kafka, users are not supported to use raw data. Also, Kafka deletes the original data immediately after the data is compressed.
Use case event flow
Yahoo originally developed Pulsar as a platform for publishing / subscribing to messages (also known as cloud messaging). However, Pulsar is now not only a messaging platform, but also a unified messaging and event flow platform. Pulsar introduces a series of tools as part of the platform to provide the necessary foundation for building event flow applications. Pulsar supports the following event streaming features:
Unlimited event stream storage supports large-scale storage of events by expanding log storage (through Apache BookKeeper), and Pulsar's built-in tiered storage supports high-quality, low-cost systems such as S3, HDFS, and so on.
The unified publish / subscribe message model allows users to easily add messages to applications. This model can be scaled according to traffic and user needs.
The protocol processing framework, the protocol compatibility of Pulsar with Kafka (through Kafka-on-Pulsar,KoP), and AMQP (through AMQP-on-Pulsar) enable applications to produce and consume events at any location using any existing protocol.
Pulsar IO provides a set of connector integrated with large ecosystems, allowing users to retrieve data from external systems without writing code.
The integration of Pulsar and Flink supports comprehensive event handling.
Pulsar Functions provides a lightweight serverless framework to handle incoming events.
The integration of Pulsar and Presto (Pulsar SQL) enables data experts and users to use ANSI-compatible SQL to analyze data and process business.
Message routing
It has the function of comprehensive routing through Pulsar IO, Pulsar Functions and Pulsar Protocol Handler,Pulsar. The routing functions of Pulsar include content-based routing, message transformation, and message expansion.
Compared with Kafka, Pulsar has more robust routing capabilities. Pulsar provides a more flexible deployment model for connector and Functions. Easy deployment can be run in broker. In addition, the deployment can also be run in a dedicated node pool (similar to Kafka Streams), which supports large-scale expansion. Pulsar also integrates natively with Kubernetes. In addition, Pulsar can be configured to schedule Functions and connector workloads in the form of pod, taking full advantage of the resilience of Kubernetes.
Message queue
As mentioned earlier, Pulsar was originally developed as a unified messaging publish / subscribe platform. The Pulsar team deeply understood the advantages and disadvantages of the existing open source messaging systems, and designed the unified messaging model of Pulsar based on the team's experience. Pulsar message API supports both queues and streams. Not only can worker queues be implemented to send messages to competing consumer (through shared subscriptions), but event flows can also be supported in two ways: one is based on the order of messages in the partition (through disaster recovery subscriptions), and the other is based on the order of messages in key ranges (through key shared subscriptions). Users can build message applications and event flow applications on the same set of data without having to copy data to different data systems.
In addition, the Pulsar community is trying to enable Apache Pulsar to natively support different messaging protocols (such as AoP, KoP) to extend the functionality of Pulsar.
At this point, the study on "what is the difference between Pulsar and Kafka" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 227
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.