How to use Scala to develop Apache Kafka 07/19 Update SLTechnology News&Howtos

How to use Scala to develop Apache Kafka

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article introduces the knowledge of "how to use Scala to develop Apache Kafka". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Apache Kafka is a popular distributed streaming media platform. Thousands of companies, such as New Relic, Uber and Square, are using it to build scalable, high-throughput, reliable real-time streaming systems. For example, New Relic's Kafka cluster processes more than 15 million messages per second, with a total data rate of close to 1 Tbps.

Kafka is very popular among application developers and data scientists because it greatly simplifies the processing of data streams. However, the practice of Kafka on Scala can be more complicated. If consumers cannot keep up with the data flow and the message disappears before they see it, then the high-throughput publish / subscribe model with automatic data retention limits is not very useful. Similarly, it is useless if the system that hosts the data stream cannot be expanded to meet the needs or is unreliable.

In order to reduce this complexity, the author divides the possible problems into 4 categories with a total of 20 items, in order to facilitate users to understand:

Partitions (partition)

Consumers (Consumer)

Producers (producer)

Brokers

Kafka is an efficient distributed messaging system that provides built-in data redundancy and resilience while retaining high throughput and scalability. It includes automatic data retention restrictions, making it ideal for applications that treat data as streams, and also supports "compressed" streams that model key-value pair mappings.

Before you understand best practices, you need to be familiar with some key terms:

Message message: a record or data unit in Kafka. Each message has a key (key) and a value (value), along with an optional title.

Producer: the producer publishes messages to the topic of Kafka. Producers decide which topic partition to publish, either randomly (looping) or using a message key-based partitioning algorithm.

Broker:Kafka runs in a distributed system or cluster, and each node in the cluster is called broker.

Topic:Topic is the category in which data records or messages are published. Consumers subscribe to topic to read the data written to it.

The Topic partition:topic is divided into multiple partitions, and each message has an offset. Each partition is usually copied at least once or twice. Each partition has a leader and at least one copy (data copy) that exists on the follower to prevent broker from failing. All broker in the cluster are leader and follower, but the agent has at most one copy of topic partition, and leader is used for all read and write operations.

Offset: assigns an offset to each message in the partition, which is a monotone incremental integer that is used as the unique identifier of the message in the partition.

Consumers: consumers read messages on Kafka topics by subscribing to topic partition, consume applications, and process messages to get the job done.

Consumer group: consumers can be organized into consumer groups and allocate topic partition to balance all users in the group. In the consumer group, all consumers work in load balancing mode. In other words, each consumer in the group will see each message. If a consumer leaves, the partition is assigned to other consumers in the group, a process called rebalancing. If there are more consumers in the group than in the partition, some consumers will be idle. If there are fewer consumers in the group than in the partition, some consumers will use messages from multiple partitions.

Lag: when the consumer cannot read the message from the partition, the consumer will appear Lag, which is the number of offsets after the top of the partition. The time it takes to recover from the Lag state depends on the rate at which consumers consume messages per second:

Time = messages / (consume rate per second-produce rate per second)

Part one: best practices for using partitions!

In the partition section, we need to know the data rate of the partition to ensure that we have the correct reserved space. The data rate of the partition is the rate at which data is generated. In other words, it is the average message size multiplied by messages per second. The data rate determines the required reserved space in bytes for a given period of time. If you do not know the data rate, you cannot correctly calculate the amount of space required to meet the basic retention goal. The data rate specifies the minimum performance that individual consumers need to support to ensure that Lag does not occur.

Unless there are other architectural requirements, random partitions are used when writing to topic. When performing large-scale operations, uneven data rates between partitions can be difficult to manage. Attention should be paid to the following three aspects:

1. First, consumers in the "hotspot" (higher throughput) partition must process more messages than other consumers in the consumer group, which can lead to processing and network bottlenecks.

2. Second, topic reserved space must be resized for partitions with the highest data rates, which may result in increased disk usage for other partitions in the topic.

Finally, achieving the best balance in partition leadership is more complex than simply extending to all brokers. The weight of a "hot spot" partition may be 10 times that of another partition in the same topic.

Part two: use consumer best practices!

If the consumer is running a version of Kafka earlier than 0.10, upgrade. In version 0.8.x, consumers use Apache ZooKeeper for consumer group coordination, and many known errors can lead to long-running balancing or even the failure of rebalancing algorithms (we call it a "rebalancing storm"). During rebalancing, one or more partitions are assigned to each user in the user group. In rebalancing, zonal ownership is constantly flexible among consumers, preventing any consumer from making real progress in consumption.

4. Adjust the consumer socket buffer for high-speed acquisition. In Kafka 0.10.x, the parameter is isreceive.buffer.bytes and the default is 64kB. In Kafka 0.8.x, the parameter is socket.receive.buffer.bytes and the default is 100kB. Both defaults are too small for high-throughput environments, especially if the network bandwidth latency between brocker and consumers is greater than that of local area network (LAN). For high-bandwidth networks with a latency of 1 millisecond or more (10 Gbps or more), consider setting the socket buffer to 8 or 16 MB. If you run out of memory, consider 1 MB, or you can use a value of-1 so that the underlying operating system can adjust the buffer size according to network conditions. However, for systems that need to start "hot" consumers, the speed of automatic adjustment may be slow.

5. Design high-throughput consumers to implement back pressure when guaranteed, preferably consuming only what can be handled effectively, rather than consuming so much that the process stops and exits the consumer group. Consumers should use a fixed size buffer (see Disruptor mode), and if running in a Java virtual machine (JVM), it is best to use it off the heap. A fixed-size buffer will prevent consumers from dragging large amounts of data onto the heap, and JVM spends all its time doing garbage collection instead of doing what you want it to do-- processing messages.

6. When running consumers on JVM, pay attention to the possible impact of garbage collection on consumers. For example, prolonged pauses in garbage collection may cause ZooKeeper sessions or consumer groups to lose balance. The same is true for brocker, where if garbage collection is paused for too long, it may exit from the cluster.

Part III: use producer best practices!

7. The configuration producer waits for confirmation. This is how the producer knows that the message has actually been sent to the partition on brocker. In Kafka 0.10.x, set to acks; in 0.8.x, it is request.required.acks. Kafka provides fault tolerance through replication, so a failure of a single node or a change in partition leader does not affect availability. If the producer is configured to have no ack (also known as "fire and forget"), the message may be lost silently.

8. Configure the number of producer retries. The default value is 3, which is usually too low. The correct value depends on the requirements, and for applications that cannot tolerate data loss, consider Integer.MAX_VALUE (actually infinite), which prevents the brocker of the leader partition from being unable to respond to production requests immediately.

For high-throughput producers, adjust the buffer size, especially buffer.memory and batch.size (in bytes). Because batch.size is set by partition, producer performance and memory usage can be associated with the number of partitions in topic. The value here depends on several factors: the producer data rate (the size and number of messages), the number of partitions generated, and the amount of memory available. Keep in mind that larger buffers are not always good, and if the producer pauses for some reason (for example, a leader responds slowly by acknowledging), caching more data on the heap may result in more garbage collection.

10. Develop application tracking metrics, such as the number of messages generated, the average size of messages generated and the number of messages consumed.

Part IV: brocker Best practices!

11. Topic requires brocker memory and CPU resources, log compression requires heap (memory) and CPU cycles on brocker to complete successfully, and failed log compression will put brocker at unlimited growth in partition risk. You can use tunelog.cleaner.dedupe.buffer.size and log.cleaner.threads on brocker, but keep in mind that these values affect heap usage on brocker. If brocker throws an OutOfMemoryError exception, it closes and may lose data. The buffer size and the number of threads will depend on the number of topic partitions to be cleaned up and the data rate and key size of messages in those partitions. Starting with Kafka version 0.10.2.1, monitoring log cleaner log files for ERROR entries is the surest way to detect thread problems with log cleaners.

12. Monitor the network throughput of brocker Make sure to do this using send (TX) and receive (RX), disk I hand O, disk space, and CPU usage. Capacity planning is a key part of maintaining cluster performance.

13. Allocate the partition leader among the brocker in the cluster, which requires a lot of network Icano resources. For example, when running with replication factor 3, leader must receive partition data, pass it synchronously to all replicas, and then transfer it to consumers who want to use it. So, in this example, as a leader, there is at least four times as much use of the network I follower O as the leader has to read from disk and the follower only needs to write.

14. Don't ignore the synchronous copy (ISR) reduction of monitoring brocker, inadequate partitions, and unpopular lesder. These are signs of potential problems in the cluster. For example, the frequent ISR contraction of a single partition may indicate that the data rate of that partition exceeds the ability of leader to serve consumers and replica threads.

15. Modify the Apache Log4j attribute as needed. Kafka agent logging may take up too much disk space. However, do not abandon logging completely, brocker logging may be the best and sometimes the only way to reconstruct the sequence of events after an event occurs.

Disable topic to automatically create specific policies and regularly clean up unused topic. For example, if you don't see any messages in x days, consider topic invalidation and remove it from the cluster to avoid creating additional metadata that must be managed in the cluster.

For persistent high-throughput agents, provide enough memory to avoid reading from the disk system, and partition data should be provided directly from the operating system's file system cache as much as possible. However, this means that consumers must be able to keep up, and lagging consumers will force brocker to read from disk.

For large clusters with high throughput service level objectives (SLO), consider isolating topic to a subset of brocker. How to determine which topic to isolate depends on business requirements, for example, if you have multiple online transaction processing (OLTP) systems using the same cluster, isolate the topic of each system to a different subset of brocker to help limit the potential explosion radius of events.

19. Older clients that use newer topic message formats (and vice versa) impose an additional burden on brocker programs when converting the format on brocker clients, and avoid this situation as much as possible.

Don't assume that testing brocker on a local desktop represents performance in an actual production environment. Testing the loopback interface of a partition with replication factor 1 is a completely different topology from most production environments. Network latency can be ignored through loopback, and the time required to receive leader acknowledgements can vary widely when replication is not involved.

This is the end of "how to use Scala to develop Apache Kafka". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.