Using Python to master the three libraries that Apache Kafka should know 04/23 Update SLTechnology News&Howtos

Using Python to master the three libraries that Apache Kafka should know

2025-04-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article is about the use of Python to master Apache Kafka should know what the three libraries are, the editor thinks it is very practical, so share it with you to learn, I hope you can get something after reading this article, say no more, follow the editor to have a look.

Apache Kafka is a distributed streaming platform that can publish, subscribe, store and process messages in real time. Its pull-based architecture reduces the pressure of heavy load on services and makes it easy to expand. It moves large amounts of data from source to destination with low latency.

Thinking about the push and pull architecture I recently discussed with people the advantages and disadvantages of different service architectures.

Kafka is a JVM-based platform, so the mainstream programming language on the client side is Java. However, with the vigorous development of the community, high-quality open source Python clients have been available and have been used in production.

In this article, I'll introduce and compare the most famous Python Kafka clients: kafka-python,pykafka and confluent-kafka. Finally, I will express my views on the pros and cons of each library.

Why do we need Kafka?

First of all, the first thing. Why choose Kafka? Kafka is designed to enhance the event-driven architecture. It enhances the architecture by providing high throughput, low latency, high durability, and high availability solutions. This does not mean that you can have all of them at the same time, there will always be a tradeoff. Read this white paper for more information.)

How to deploy and optimize Kafka for high performance and low latency

Apache Kafka ®is a power flow processing platform, and his white paper discusses how to optimize Kafka deployments for:

In addition to its high performance, another attractive feature is the publish / subscribe model, where the sender does not send messages specifically to the recipient. Instead, the message is delivered based on the subject to a centralized location that the recipient can subscribe to.

In this way, we can easily decouple the application and get rid of the overall design. Let's look at an example to see why decoupling works better.

The site you create needs to send user activity somewhere, so you can write a direct connection from the site to the real-time monitoring dashboard. This is a simple solution and works well. One day, you decide to store user activity in a database for future analysis. Therefore, you write another direct database connection to your site. At the same time, your site is getting more and more traffic, and you want to enhance its functionality by adding alarm services, real-time analysis services, and so on.

Your architecture will eventually look like this. Issues such as a large number of code repositories, security issues, scalability issues, and maintainability issues will hurt you.

> Architecture without decoupling (Created by Xiaoxu Gao)

You need a hub to separate applications with different roles. For applications that create events, we call them producers. They posted the event to a centralized center. Each event (that is, a message) belongs to a topic. The consumer is on the other side of the hub. They subscribe to the topics they need from the center without having to talk directly to the producers.

With this model, the architecture can be easily extended and maintained. Engineers can focus more on the core business.

> Architecture with decoupling (Created by Xiaoxu Gao)

In short, the Kafka setting

You can download Apache Kafka from the official website. Getting started helps you start the server in 10 seconds.

You can also download Apache Kafka from the Confluent platform. It is by far the largest streaming data platform for Kafka. It provides a range of infrastructure services around Kafka for individuals and businesses to provide data as a real-time stream. The founder is a member of the team that originally created Apache Kafka.

Each Kafka server is called an agent, and you can run it in stand-alone mode or form a cluster. In addition to Kafka, we also need Zookeeper to store metadata about Kafka. Zookeeper acts like a coordinator and is responsible for managing the state of each agent in a distributed system.

> Kafka setup (Created by Xiaoxu Gao)

Suppose we have established the infrastructure with a Zookeeper and a Kafka broker. Now it's time to connect! The original Java client provides five API:

Producer API: a topic that publishes messages to a Kafka cluster.

Consumer API: uses messages from topics in the Kafka cluster.

Streams API: use the messages in the topic and convert them to other topics in the Kafka cluster. These operations can be filtering, joining, mapping, grouping, and so on.

Connect API: you can connect the Kafka cluster directly to the source or receiver system without coding. The system can be file, relational database, Elasticsearch and so on.

Administrator API: manages and examines topics and agents in the Kafka cluster.

Python Library of Kafka

In the Python world, three of the five API have been implemented, namely Producer API,Consumer API and Admin API. There is no such Kafka Stream API in Python yet, but a good alternative is Faust.

The tests in this section are based on 1 Zookeeper and 1 Kafka agent installed locally. This has nothing to do with performance tuning, so I mainly use the default configuration provided by the library.

Kafka-Python

Kafka-python is designed to function very similar to the official Java client, with a large number of pythonic interfaces. It is best to use it with Kafka version 0.9 +. The first edition was released in March 2014. It is under active maintenance.

Installation

Pip install kafka-python

Each message is sent asynchronously through send (). When called, it adds the record to the buffer and returns immediately. This allows producers to send records to Kafka brokers in batches to improve efficiency. Asynchronism can greatly improve speed, but we should also know the following:

In asynchronous mode, sorting is not guaranteed. You have no control over when the Kafka broker confirms (confirms) each email.

It is a good habit to provide success callbacks and failure callbacks to producers. For example, you can write an information log message in a successful callback and an exception log message in a failed callback.

Since the order cannot be guaranteed, additional messages may be sent before an exception is received in the callback.

If you want to avoid these problems, you can choose to send messages synchronously. The return of send () is FutureRecordMetadata. By executing future.get (timeout = 60), the producer will be blocked for up to 60 seconds until the agent successfully acknowledges the message. The disadvantage is speed, which is relatively slow compared to asynchronous mode.

Consumer

The consumer instance is a Python iterator. The core of the consumer class is the poll () method. It allows the consumer to continue to extract messages from the topic. One of its input parameters, timeout_ms, defaults to 0, which means that the method immediately returns all records pulled out of the buffer and available. You can increase the timeout_ms to return larger batches.

By default, each consumer is an infinite listener, so it does not stop running until the program is interrupted. On the other hand, you can stop the user based on the message you receive. For example, you can exit the loop and turn off the consumer when an offset is reached.

You can also assign consumers to a single partition or multiple partitions from multiple topics.

This is the test result of the kafka-python library. The size of each message is 100 bytes. The average throughput of producers is 1.4MB / s. The average throughput of users is 2.8MB / s.

Confluent-kafka

Confluent-kafka is a high-performance Kafka client of Python, which makes use of the high-performance C client librdkafka. Since version 1. 0, these have been distributed as separate binary wheels for OS X and Linux on PyPi. It supports Kafka 0.8 + version. The first edition was released in May 2016. It is under active maintenance.

Installation

For OS X and Linux, librdkafka is included in the package and needs to be installed separately.

Pip install confluent-kafka

For Windows users, at the time of this writing, confluent-kafka did not support Python3.8 binary wheels on Windows. You will encounter problems with librdkafka. Please check their release notes, which are under active development. Another solution is to downgrade to Python3.7.

Confluent-kafka has incredible performance in terms of speed. The design of API is similar to that of kafka-python. You can synchronize flush () by putting it in a loop.

Consumer

Consumer API in confluent-kafka requires more code. Instead of dealing with advanced loop methods (for example, consumption ()) yourself, you need to handle while loops yourself. I suggest you create your own consump (), which is actually a Python generator. As long as a message is pulled out and available in the buffer, it generates the message.

In this way, the main functions will become clean, and you will be free to control the behavior of consumers. For example, you can define a "session window" in consumpt (). If no message is extracted within X seconds, the consumer will stop. Alternatively, you can add the flag infinite = True as an input parameter to control whether the user should be an infinite listener.

This is the test result of the confluent-kafka library. The size of each message is 100 bytes. The average throughput of the producer is 21.97MBps. The average consumer throughput is 16.8~28.7MB / s.

PyKafka

PyKafka is a programmer-friendly Kafka client for Python. It includes Python implementations of Kafka producers and consumers, optionally supported by librdkafka-based C extensions. It supports Kafka 0.82 + version. The first edition was released in August 2012, but has not been updated since November 2018.

Installation

Pip install pykafka

This package does not come with librdkafka and you need to install it separately in all operating systems.

Pykafka has a KafkaClient interface, which covers both ProducerAPI and Consumer API.

Messages can be sent in both asynchronous and synchronous modes. I have found that pykafka changes the default values of some producer configurations (such as linger_ms and min_queued_messages), which has an impact on sending a small amount of data.

You can compare it with the default configuration on the Apache Kafka website.

If you want to get a callback for each message, be sure to change the min_queued_messages to 1, otherwise you will not receive any reports if the dataset is less than 70000.

> pykafka-producer-config

Consumer

You can get SimpleConsumer from the KafkaClinet interface. This is similar to kafka-python, where opinion polls are wrapped in the SimpleConsumer class.

This is the test result of the pykafka library. The size of each message is 100 bytes. The average throughput of producers is 2.1MB / s. The average throughput of users is 1.57MB / s.

Conclusion

So far, I have explained the Producer API and Consumer API of each library. As far as Admin API is concerned, kafka-python and confluent-kafka do provide explicit Admin API. You can use it in the unit test where you want to create the theme, and then delete it before executing the next test. In addition, if you want to use Python to build Kafka monitoring dashboards, Admin API can help you retrieve metadata for clusters and topics.

Confluent-kafka:

There is no doubt that Confluent-kafka performs best of the three libraries. The API is carefully designed with the same name and default values as the original Apache Kafka. You can easily link it to the original parameters. Personally, I like the flexibility of customizing consumer behavior. Confluent is also actively developing and supporting it.

The downside is that Windows users may take some time to make them work. And debugging can be tricky because of the C extension.

Kafka-python:

Kafka-python is a pure Python library without C extensions. The API is carefully designed and easy to use for beginners. This is also an active development project.

The disadvantage of python-kafka is its speed. If you are really concerned about performance, it is recommended that you use confluent-kafka instead.

Pykafka:

Compared with kafka-python and conflunet-kafka, pykafka has less development activities. The history of this version shows that it has not been updated since November 2018. In addition, this may not be the first time that pykafka has a different API design and uses different default parameters.

These are the three libraries that you should know when using Python to master Apache Kafka. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.