How to use Connect to realize data Migration in EMR-Kafka 04/22 Update SLTechnology News&Howtos

How to use Connect to realize data Migration in EMR-Kafka

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about how to use Connect to achieve data migration in EMR-Kafka. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article.

1. Background

In streaming processing, it is often encountered that Kafka synchronizes with other systems or data migration between Kafka clusters. Data synchronization or data migration can be realized easily and quickly by using EMR Kafka Connect.

Kafka Connect is a scalable and reliable tool for fast streaming data transfer between Kafka and other systems. For example, you can use Kafka Connect to obtain the binglog data of the database, move the data of the database into the Kafka cluster to synchronize the data of the database, or connect to the downstream streaming system. At the same time, the REST API interface provided by Kafka Connect can easily create and manage Kafka Connect.

Kafka Connect is divided into two operation modes: standalone and distributed. In standalone mode, all worker runs in a single process; in contrast, distributed mode is more scalable and fault-tolerant, which is the most commonly used mode and is recommended in production environments.

This article introduces the use of EMR Kafka Connect's REST API interface for data migration between Kafka clusters, using distributed mode.

two。 Environmental preparation

Create two EMR clusters with the cluster type Kafka. EMR Kafka Connect is installed on the task node, and the destination Kafka cluster for data migration needs to create a task node. After the cluster is created, the EMR Kafka Connect service on the task node starts by default with a port number of 8083.

Pay attention to ensure the interconnection between the two clusters. For more information on the creation process, please see create Cluster https://help.aliyun.com/document_detail/28088.html.

3. Data Migration 3.1 preparation

The configuration file path for EMR Kafka Connect is / etc/ecm/kafka-conf/connect-distributed.properties.

Create topic that needs to be synchronized in the source Kafka cluster, for example

In addition, Kafka Connect saves the offsets, configs, and task status in topic, and the topic name corresponds to the three configuration items offset.storage.topic, config.storage.topic, and status.storage.topic in the configuration file. By default, Kafka Connect automatically creates these three topic using the default partition and replication factor.

3.2Create Kafka Connect

At the task node of the destination Kafka cluster (for example, the emr-worker-3 node), use the curl command to create a Kafka Connect from the json data.

Curl-X POST-H "Content-Type: application/json"-- data'{"name": "connect-test", "config": {"connector.class": "EMRReplicatorSourceConnector", "key.converter": "org.apache.kafka.connect.converters.ByteArrayConverter", "value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter", "src.kafka.bootstrap.servers": "${src-kafka-ip}: 9092" "src.zookeeper.connect": "${src-kafka-curator-ip}: 2181", "dest.zookeeper.connect": "${dest-kafka-curator-ip}: 2181", "topic.whitelist": "${source-topic}", "topic.rename.format": "${dest-topic}", "src.kafka.max.poll.records": "300"}} 'http://emr-worker-3:8083/connectors

In json data, the name field represents the name of the created connect. Here, the connect-test;config field needs to be configured according to the actual situation. The variables are described in the following table.

The field describes the topic that needs to be synchronized in the topic.whitelist source Kafka cluster, with multiple topic separated by commas, for example, the optional configuration of connecttopic.rename.format, and the topic after synchronization in the destination Kafka cluster. The default is ${topic.whitelist} .synchronization. For example, the source topic is connect, and the synchronized topic is the connect.replicasrc.kafka.bootstrap.servers source Kafka cluster broker address src.zookeeper.connect source Kafka cluster with zookeeper service installed in the node private network IPdest.zookeeper.connect destination Kafka cluster installed zookeeper service node private network IP3.3 view Kafka Connect

View all Kafka Connect

View the status of the created connect-test

View the status of the created connect-test and view the information of the task

3.4 data synchronization

Create data that needs to be synchronized in the source Kafka cluster.

3.5 View synchronization results

Synchronized data is consumed in the destination Kafka cluster.

You can see that 100000 pieces of data sent in the source Kafka cluster have been migrated to the destination Kafka cluster.

The above is how to use Connect to achieve data migration in EMR-Kafka. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.