How to synchronize Kafka data to MaxCompute 07/15 Update SLTechnology News&Howtos

How to synchronize Kafka data to MaxCompute

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to synchronize Kafka data to MaxCompute, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.

First, background introduction

1. Experimental purpose

In their daily work, many enterprises do two aspects of processing after collecting the behavior logs and business data generated by APP or websites through Kafka. On the one hand, offline processing, on the other hand, real-time processing. And it will generally be delivered to MaxCompute as a model for related business processing, such as user characteristics, sales ranking, order regional distribution and so on. After the data is formed, it will be displayed in the data report.

two。 Scheme description

There are two links for Kafka data synchronization to DataWorks. One link is the business data and behavior log through Kafka, then uploaded to Datahub through Flume, and Max Compute, and finally displayed in QuickBI. The other link is the business data and behavior log through Kafka and DataWorks,MaxCompute, which is finally displayed in QuickBI.

This time shows the process of uploading Kafka to MaxCompute through DataWorks. Uploading data from DataWorks to MaxCompute is synchronized through two schemes. The first is a custom resource group, and the second is an exclusive resource group. Custom resource groups are generally suitable for cloud scenarios on data on complex networks. The operation mode of exclusive resource group is mainly aimed at the situation of insufficient integrated resources.

Cdn.com/4b5c9287751761510b5df37aabb73b67974bb430.png ">

Second, the specific operation process

The use and principle of 1.Kafka message queue

Kafka Product Overview: message queuing for Apache Kafka is a distributed, high-throughput, scalable message queuing service provided by Aliyun. Message queuing for Apache Kafka is generally used in big data fields such as log collection, monitoring data aggregation, streaming data processing, online and offline analysis, and so on. Message queuing for Apache Kafka provides fully managed services for open source Apache Kafka, completely solving the long-standing pain points of open source products. Cloud Kafka has the advantages of low cost, more flexibility and more reliability. Users only need to focus on business development and do not need to deploy OPS.

Introduction to Kafka architecture: a typical Kafka cluster is divided into four parts. Producer produces data and sends messages to the Kafka Broker of message queuing for Apache Kafka through push mode. The messages sent can be page visits to the website, server logs, or CPU and memory-related system resource information. The server that Kafka Broker uses to store messages. Kafka Broker supports horizontal scaling. The greater the number of Kafka Broker nodes, the higher the throughput of the Kafka cluster. Kafka Broker aims at the concept of topic meeting partition. Partition has the role assignment of leader and follower. Consumer subscribes and consumes leader's information data from the message queue for Apache Kafka Broker through the pull mode. Offset is used as the consumption point of the message within the partition. Manage the configuration of the cluster through ZooKeeper, elect the leader partition, and manage the load balancing of the partition_leader when the Consumer Group changes.

Kafka message queue purchase and deployment: users can first click on the Kafka message queue product page to purchase, and select the consumption method, region, instance type, disk, traffic and message storage time according to their individual circumstances. One of the more important points is to select the corresponding region, if the user's MaxCompute is in North China, then try to choose North China. Deployment is required after activation is selected. Click deploy and select the appropriate VPC and its switch for deployment.

When the deployment is complete, go to the Kafka Topic management page and click create Topic to enter your own Topic. There are three pieces of information under the Topic naming. Try to distinguish between financial and commercial business as far as possible. Step 4: go to Consumer Group management and click create Consumer Group to create the Consumer Group you need. The naming of Consumer Group also needs to be standardized, if it is financial or business business, try to correspond to your own Topic.

Kafka whitelist configuration: confirm the whitelist of servers or products that need to access Kafka after Kafka installation and deployment. The default access point in the following figure is the access interface.

2. Resource group introduction and its configuration

Background for the use of custom resource groups: custom resource groups generally address network problems between IDC. There are differences between the local network and the cloud network. For example, DataWorks can upload massive data to the cloud through the free transfer capability (default task resource group), but the default resource group cannot meet the requirements of high transfer speed or synchronous data sources in complex environments. At this point, users can use custom resource groups to synchronize cloud requirements in a complex environment, solve the problem that the DataWorks default resource group is not connected with your data source, or achieve higher transmission speed. However, the custom resource group mainly solves the problem of cloud synchronization in the complex network environment, and achieves the synchronization of data transmission between any network environment.

Customize the configuration of the resource group: the configuration of the custom resource group requires six steps. First, click to enter the DataWorks console, click to open the workspace list, select the project space that the user needs, and click to enter the data integration, that is, to confirm which space project your data integration is to be added under. After that, click to enter the data source interface, and click add Custom Resource Group. Note that only the project administrator has permission to add the new custom resource group in the upper right corner of the page.

The third step is to make sure that Kafka belongs to the same VPC as the custom resource group to be added. In this experiment, ECS sends messages to Kafka, and the VPC of the two should be the same. Step 4: log in to ECS, that is, the individual's custom resource group. Execute the command dmidecode | grep UUID to get the UUID of ECS.

The fifth step is to fill in the IP or machine CPU and memory of the added server UUID and the custom resource group. Finally, execute the relevant commands on ECS. There are 5 steps to install Agent. Confirm one by one. Click Refresh after the completion of step 4 to check whether the service is available. After the addition is completed, check the connectivity test to see if the addition is successful.

Background of exclusive resource group: some customers report insufficient resources when synchronizing from Kafka to MaxCompute. You can synchronize data by adding a new exclusive resource group. In exclusive resource mode, the machine's physical resources (network, disk, CPU, memory, etc.) are completely exclusive. You can isolate not only the use of resources between users, but also the use of resources for different workspace tasks. In addition, exclusive resources also support flexible capacity expansion and reduction functions, which can meet the independence of resources.

Enjoy, flexible configuration and other needs. Exclusive resource groups can access VPC data sources in the same region, as well as cross-region public network RDS addresses.

Configuration of exclusive resource group: the configuration of exclusive resource group mainly requires two steps. First, enter the resource list in the DataWorks console, and click add exclusive resource group, including exclusive integrated resource group and exclusive scheduling resource group. Select the new exclusive integrated resource group here. When you click to buy, you should still select the corresponding purchase method, region, resources, memory, time limit, quantity, and so on.

After the purchase, you need to bind the exclusive integrated resource group to the VPC corresponding to the Kafka, click the proprietary network binding, and select the switch corresponding to the Kafka (the most obvious difference is the availability zone) and the security group.

3. Synchronization process and matters needing attention

To synchronize Kafka to MaxCompute, you need to configure relevant parameters and pay attention to the following.

DataWorks data integration operation: enter the DataWorks interface, click create Business process, add a data synchronization node to the newly created business process, and then name it.

Enter the data synchronization node, including the Reader side and the Writer side, and click the Reader side data source is Kafka, and the Writer side data source is ODPS. Click convert to script mode. Some synchronization parameters on the Reader or Writerside can be clicked here for easy reading, manipulation and understanding.

The main parameters of Kafka Reader: the main parameters of Kafka Reader are first server, and the default access point of Kafka mentioned above is one of the server,ip:port. Note that server is a required parameter here. Topic, which means that after the Kafka deployment is complete, Kafka processes the topic of the data source, which is also a required parameter. The next parameter is for column column,column to support constant columns, data columns, and attribute columns. Constant columns and data columns are less important. The complete message of synchronization is generally stored in the property column value, and if you need other information, such as partition, offset, timestamp, you can also filter in the property column. Column is a required parameter.

There are 6 types of keyType and valueType. According to the data synchronized by users, select the corresponding information and synchronize one type. You need to pay attention to whether the synchronization mode is synchronized by message time or by consumption point location. There are four scenarios for synchronization by data consumption point location, beginDateTime,endDateTime,beginOffset,endOffset. BeginDateTime and beginOffset choose one or two as the starting point of data consumption. Choose between endDateTime and endOffset. It should be noted that Kafka0.10.2 version or above is required in beginDateTime and endDateTime to support synchronization by data consumption point. It is also important to note that there are three special forms of beginOffset: seekToBeginning, indicating consumption data from the starting point; seekToLast, indicating that consumption data can only be consumed once according to the offset position of the last consumption, and multiple consumption if beginDateTime is used, depending on the storage time of the message; and seekToEnd, indicating that empty data will be read from the last point of consumption data.

SkipExceeedRecord is not very useful, it is not necessary to fill in the item. Partition reads and consumes all topic partitions together, so there is no need to customize a partition, it is not required. KafkaConfig, if there are other relevant configuration parameters, you can extend the configuration in kafkaConfig,kafkaConfig is also not required.

The main parameter of MaxCompute Writer: dataSource is the name of the data source, add ODPS data source. Tables, which indicates the table name of the created data table, the table to which the data of Kafka is to be synchronized, and the corresponding fields can also be established.

Partition, if the table is partitioned, it must be configured to the last level of partition to determine the synchronization location. If it is a non-partitioned table, it does not need to be filled in. Column, try to do one-to-one corresponding operations with the relevant fields in Kafka column. The information synchronization can only be confirmed if the synchronization field corresponds to the synchronization field. Truncate, whether to write synchronized data in append mode or overwrite mode, to avoid multiple DDL operating a partition at the same time, or to create partitions in advance before multiple concurrent jobs start.

Kafka synchronizes data to the reader side of MaxCompute:Kafka, the Writerside of MaxCompute, and limit parameters. Reader contains server, endOffset, kafkaConfig, group.id, valueType, ByteArray, column fields, topic, beginOffset, seekToLast, and so on. The Writerside of MaxCompute includes overwriting, appending, compressing, viewing source code, synchronizing tables and fields to do one-to-one correspondence with the Reader side of Kafka, and the most important thing is value data synchronization. Limit parameters, mainly errorlimit, data more than a few errors will report an error; speed, can limit the flow rate, concurrency and so on.

Refer to the Kafka producer SDK to write the code: the final production data will be sent to Kafka, and the user's production data can be viewed through the relevant code. A piece of code indicates the reading of the configuration information, the protocol, the serialization method, and the wait time for the request, which topic needs to be sent, and what kind of message to send. Send a message back after it is sent.

The code package runs on ECS (same availability zone as Kafka): execute the crontab-e command, as shown in the following figure, every 17:00. The following figure shows the message record after sending the log.

Create tables on MaxCompute: go to the DataWorks business process page, create target tables, use a DDL statement to create synchronized tables, or create fields for different tables based on the user's personal business.

4. Development, testing and production deployment

Select a custom resource group (or exclusive integrated resource group) for synchronization: as shown in the following figure, select "configure Task Resource Group" in the upper right corner, select the resource group according to the user's individual needs, and click execute. When the execution is complete, the identity display is successful, the data record is synchronized, and the result is successful. The synchronization process is almost over.

Query the data synchronization results: view the synchronization results in the DataWorks interface, and click the query command, select * from testkafka3 (table) on the temporary node to view the data synchronization results. The data has been synchronized, which proves that the test is successful.

Setting scheduling parameters: after the business process development data synchronization, some business processing will be carried out on the relevant models, and finally some SQL nodes and synchronization nodes will be designed for deployment. As shown in the following figure, click scheduling configuration on the right and enter the scheduling time.

Submit the business process node, and package the release: click the business process, select the node to be submitted and submit. Some business processes do not need to be placed in the production environment after they are submitted. Then enter the task publishing interface and add nodes to be released for task deployment.

Confirm that the business process is released successfully: finally, on the OPS page, confirm whether the release exists in the production environment. At this point, Kafka synchronizes data to the end of the MaxCompute process. At the corresponding scheduling time, the log of the node will be displayed on each node or in the upper right corner. You can check whether the log is running normally, or whether subsequent operations are needed to deploy data or related commands.

This is the answer to the question about how to synchronize Kafka data to MaxCompute. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.