How to build a large-scale Kafka cluster 07/16 Update SLTechnology News&Howtos

How to build a large-scale Kafka cluster

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how to create a large-scale Kafka cluster". In the daily operation, I believe that many people have doubts about how to create a large-scale Kafka cluster. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts about "how to create a large-scale Kafka cluster". Next, please follow the editor to study!

-Uber's Kafka ecosystem-

Uber has the largest Kafka cluster in the world, processing trillions of messages and several PB data every day. As shown in figure 1, Kafka is now the cornerstone of the Uber technology stack, on which we have built a complex ecosystem that supports a large number of different workflows. It includes a publish / subscribe message bus for delivering App event data from passengers and drivers, supporting streaming analysis platforms such as Apache Samza and Apache Flink, streaming database change logs to downstream subscribers, and receiving various data to Uber's Hadoop data lake.

Figure Kafka ecosystem of 1:Uber

In order to build a scalable, reliable, high-performance, easy-to-use messaging platform based on Kafka, we have overcome many challenges. In this article, we will highlight a challenge in disaster recovery (caused by cluster downtime) and share how we build a multi-regional Kafka infrastructure.

-Kafka multi-area deployment of Uber-

Providing business resilience and continuity is a top priority for Uber. We have a detailed disaster recovery plan to minimize the impact of natural and man-made disasters, such as power outages, catastrophic software failures, and network outages. We adopt a multi-area deployment strategy to deploy services along with backups in distributed data centers. When the physical infrastructure of one area is not available, the service can still run in other areas.

We build a multi-area Kafka architecture to achieve data redundancy and provide support for regional failover. Many services in the Uber technology stack rely on Kafka for region-level failover. These services are downstream of Kafka and assume that the data in Kafka is available and reliable.

Figure 2 depicts the multi-area Kafka architecture. We have two types of clusters: producers publish messages locally to regional clusters and copy messages from regional clusters to aggregation clusters to provide a global view. For simplicity, figure 2 shows only two regional clusters.

Figure 2: Kafka replication topology between two areas

In each region, the producer always produces messages locally for better performance, and when the Kafka cluster is not available, the producer moves to another region and then produces the message to the regional cluster in that region.

A key part of this architecture is message replication. Messages are replicated asynchronously from regional clusters to aggregation clusters in other regions.

-consume messages from multi-region Kafka clusters-

Consuming messages from multi-regional clusters is more complex than producing messages. Multi-area Kafka clusters support two types of consumption patterns.

Double active mode

A common type is the Active/Active consumption model, in which consumers consume the theme of aggregating clusters in their respective regions. Many Uber applications use this pattern to consume messages from a multi-area Kafka cluster instead of connecting directly to other regions. When one area fails, if the Kafka stream is available in both regions and contains the same data, then the consumer will switch to another region.

For example, figure 3 shows how Uber's dynamic pricing service (that is, peak pricing) uses the dual active mode to build a disaster recovery plan. The price is calculated based on a series of recent taxi-hailing data in the nearby area. All ride-hailing events are sent to the Kafka regional cluster and then aggregated into the aggregation cluster. Then, in each region, a complex, memory-intensive Flink job is responsible for calculating the price of different regions. Next, a full-live service is responsible for coordinating update services for each region and assigning a region as the main area. The update service in the main area saves the pricing results to the double active database for quick query.

Figure 3: double-living consumption model architecture

When there is a disaster in the main area, the double active service will take another region as the main area, and the peak price calculation will be transferred to another region. It should be noted that the computational state of the Flink job is too large to replicate synchronously between regions, so its status must be calculated using input messages from the aggregation cluster.

We have learned a key lesson from practice that reliable multiregional infrastructure services such as Kafka can greatly simplify the development of applications for business continuity plans. Applications can store state in the infrastructure layer, thus becoming stateless, leaving the complexities of state management, such as cross-regional synchronization and replication, to infrastructure services.

Active and standby mode

Another multi-regional consumption pattern is the active-standby mode (Active/Passive): only one consumer (identified by a unique name) is allowed to consume messages from an aggregate cluster of one region (that is, the primary region) at a time. The multi-area Kafka cluster tracks the consumption progress of the main region (represented by offsets) and copies the offsets to other regions. In the event of a failure in the main area, consumers can fail over to another area and resume consumption progress. The active / standby mode is usually used by services that support strong consistency, such as payment processing and auditing.

When using the active and standby mode, the offset synchronization of inter-regional consumers is a key problem. When a user fails over to another area, it needs to reset the offset in order to resume the consumption progress. Because many of Uber's services cannot accept data loss, consumers are unable to recover from high water levels (that is, the latest news). In addition, in order to avoid too much backlog, consumers cannot resume spending from low water levels (that is, the earliest news). In addition, messages from regional clusters to aggregation clusters may become disordered. Due to cross-regional replication delays, messages are replicated from a regional cluster to a local aggregation cluster faster than a remote aggregation cluster. As a result, the order of messages in the aggregation cluster may be different. For example, in figure 4a, messages A1, A2, B1, and B2 are published almost simultaneously to the regional clusters of region An and region B, but after aggregation, they are in a different order in the two aggregation clusters.

Fig. 4. Cross-region message replication b. Message replication checkpoint

In order to manage the offset mapping of these areas, we have developed a complex offset management service with the architecture shown in figure 5. When uReplicator copies messages from the source cluster to the destination cluster, it periodically checks the offset mapping from the source to the destination. For example, figure 4b shows the offset mapping of the message replication in figure 4a. The first row of the table records the message A2 of the region A region cluster (the offset in the region cluster is 1) to the message A2 of the region An aggregation cluster (the offset in the aggregation cluster is 1). Similarly, the remaining lines record checkpoints for other replication routes.

The offset management service stores these checkpoints in a dual active database and uses them to calculate the offset mapping for a given active and standby consumer. At the same time, an offset synchronization job is responsible for periodically synchronizing the offsets between the two regions. When an active and standby consumer moves from one region to another, the latest offset can be obtained and used to restore consumption.

Figure 5: offset management service architecture

The offset mapping algorithm works as follows: find the nearest checkpoint for each regional cluster in the aggregation cluster that active consumers are consuming. Then, for the source offset of each zone checkpoint, find the checkpoint corresponding to the cluster they aggregate in another zone. Finally, take the smallest offset in the aggregation cluster in another region.

In figure 6, it is assumed that the current progress of the active consumer is the A3 message of area B (offset 6). According to the table checkpoint on the right, the two most recent checkpoints are A2 with an offset of 3 (blue) and B4 with an offset of 5 (red), corresponding to offset 1 (blue) in regional cluster An and offset 3 (red) of regional cluster B, respectively. These source offsets map to offset 1 (blue) and offset 7 (red) of the region An aggregation cluster. According to the algorithm, the passive consumer (black) takes the smaller offset of the two, that is, the offset 1.

Figure 6: active and standby consumers transfer from one region to another

At this point, the study on "how to build a large-scale Kafka cluster" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.