How to quickly import data from kafka into Hadoop 07/03 Update SLTechnology News&Howtos

How to quickly import data from kafka into Hadoop

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

How to quickly import kafka data into Hadoop, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

Kafka is a distributed publish-subscribe system. Because of its powerful distributed and performance characteristics, it has quickly become a key part of the data pipeline. It can do a lot of work, such as messaging, metrics collection, flow processing, and log aggregation. Another useful use of Kafka is to import data into Hadoop. The key reason for using Kafka is that it separates data producers from consumers, allowing multiple independent producers (possibly written by different development teams). Similarly, there are multiple independent consumers (or possibly written by different teams). In addition, consumers can be real-time / synchronous or batch / offline / asynchronous. When compared to other pub-sub tools such as RabbitMQ, the latter attribute is very different.

To use Kafka, there are some concepts to understand:

Topic-topic is the subscription source for related messages.

Partitions-each topic consists of one or more partitions that are ordered message queues supported by log files

Producers and consumers-producers and consumers write messages to and read from the partition.

Brokers-Brokers is a Kafka process that manages topic and partitions and provides services for producer and consumer requests.

Kafka does not guarantee a "complete" sort of topic, only that the partitions that make up the topic are ordered. Consumer applications can enforce the sorting of "global" topic as needed.

Figure 5.14 shows the conceptual model of Kafka

Figure 5.15 shows an example of how to deploy a distribution partition in Kafka

To support fault tolerance, topic can be replicated, which means that each partition can have a configurable number of replicas on different hosts. This provides greater fault tolerance, which means that a single server death is not catastrophic for data or for the availability of producers and consumers.

Kafka version 0.8 and Camus 0.8.x are used here.

Practice: use Camus to copy Avro data from Kafka to HDFS

This technique is useful when you have already flowed data into Kafka for other purposes and want to put the data in HDFS.

problem

You want to use Kafka as the data transfer mechanism to import data into HDFS.

Solution

Use Camus, a solution developed by LinkedIn, to copy data from Kafka to HDFS.

Discuss

Camus is an open source project developed by LinkedIn. Kafka is heavily deployed in LinkedIn, while Camus is used to copy data from Kafka to HDFS.

Out of the box, Camus supports two data formats in Kafka: JSON and Avro. In this technique, we will use Avro data through Camus. Camus's built-in support for Avro requires Kafka publishers to write Avro data in a proprietary way, so for this technique, let's assume that we want to serialize the data using vanilla in Kafka.

It takes three parts to make this technology work: first, write some Avro data to Kafka, then write a simple class to help Camus deserialize the Avro data, and finally run a Camus job to perform the data import.

To write Avro records to Kafka, in the following code, you need to set up the Kafka generator by configuring the necessary Kafka properties, load some Avro records from the file, and write them out to Kafka:

You can load the sample data into the topic of the Kafka named test using the following command:

Kafka console consumers can be used to verify that data has been written to Kafka, which dumps binary Avro data to the console:

When you're done, write some Camus code so that you can read these Avro records in Camus.

Practice: writing Camus and schema registries

First, you need to understand three Camus concepts:

Decoder-the job of the decoder is to convert the raw data extracted from Kafka into Camus format.

Encoder-the encoder serializes the decoded data into a format to be stored in the HDFS.

Schema Registry-provides schema information about the Avro data being encoded.

As mentioned earlier, Camus supports Avro data, but it does require Kafka producers to write data using the Camus KafkaAvroMessageEncoder class, which adds some proprietary data to Avro serialized binary data, probably because the decoder in Camus can verify that it is written by this class.

In this example, Avro serialization is used for serialization, so you need to write your own decoder. Fortunately, this is simple:

You may have noticed that we wrote a specific Avro record in Kafka, but in Camus we read this record as a generic Avro record, not a specific Avro record, because the CamusWrapper class only supports generic Avro records. Otherwise, specific Avro records can be used more easily because the generated code can be used and has all the security features that come with it.

The CamusWrapper object is the data extracted from Kafka. This class exists because it allows you to paste metadata into envelope, such as timestamps, server names, and service details. Any data that is strongly recommended has some meaningful timestamp associated with each record (usually this will be the time when the record was created or generated). You can then use the CamusWrapper constructor that takes a timestamp as an argument:

Public CamusWrapper (R record, long timestamp) {...}

If no timestamp is set, Camus creates a new timestamp when the wrapper is created. Use this timestamp and other metadata in Camus when determining the HDFS location of the output record.

Next, you need to write a schema registry so that the Camus Avro encoder knows the schema details of the Avro record being written to HDFS. When registering the schema, also specify the topic name of the Kafka from which the Avro record is pulled:

Run Camus

Camus runs as a MapReduce job on the Hadoop cluster, and you want to import Kafka data into the cluster. You need to provide a bunch of properties to Camus, which can be done using the command line or using a properties file, which we will use:

As you can see from the properties, there is no need to explicitly tell Camus which topic to import. Camus automatically communicates with Kafka to discover the topic (and partition) and the current start and end offsets.

If you want precise control over imported topic, you can use kafka.whitelist.topics and kafka.blacklist.topics to list whitelists (restrict topic) and blacklist (exclude topic), you can use commas as delimiters to specify multiple topic, and regular expressions are supported, as shown in the following example, which matches the "topic1" of topic or any topic that begins with "abc" followed by one or more numbers You can specify a blacklist using exactly the same syntax as value:

Kafka.whitelist.topics=topic1,abc [0-9] +

Once the properties are all set, you can run the Camus job:

This will cause Avro data to land in HDFS. Let's take a look at what's in HDFS:

The first file contains the imported data, and the rest is managed by Camus.

You can use the AvroDump utility to view the data files in HDFS:

So what happens when Camus work is running? The Camus import process is performed as a MapReduce job, as shown in figure 5.16.

With the success of the Camus task in MapReduce, Camus OutputCommitter (a MapReduce construct that allows custom work to be performed when the task is completed) atomically moves the task's data files to the target directory. OutputCommitter also creates offset files for all partitions that the task is working on, and other tasks in the same job may fail, but this does not affect the status of the successful task-- the data and offset output of the successful task still exist, so subsequent Camus execution will resume processing from the last known successful state.

Next, let's look at where Camus imports data and how to control behavior.

Data partition

Earlier, we saw that Camus imports Avro data at Kafka. Let's take a closer look at the HDFS path structure, as shown in figure 5.17, to see what can be done to determine the location.

Figure 5.17 parsing the Camus output path of the exported data in HDFS

The date / time of the path is determined by the timestamp extracted from the CamusWrapper, which can be extracted from the Kafka record in MessageDecoder and provided to CamusWrapper, which will allow the data to be partitioned by meaningful date instead of the default value, which is just the time when the Kafka record is read in MapReduce.

Camus supports pluggable partitioning programs that allow you to control part of the path shown in figure 5.18.

Figure 5.18 Camus partition path

The Camus Partitioner interface provides two methods that must be implemented:

For example, a custom partitioning program can create paths for Hive partitions.

Camus provides a complete solution for getting data from Kafka in HDFS and is responsible for maintaining state and error handling when problems arise. By integrating it with Azkaban or Oozie, you can easily automate and organize HDFS data to perform simple data management based on message time. It is worth mentioning that when it comes to ETL, its functionality is unassailable compared to Flume.

Kafka bundles a mechanism for importing data into HDFS. It has a KafkaETLInputFormat input format class that can be used to extract data from Kafka in a MapReduce job. You are required to write a MapReduce job to perform the import, but the advantage is that you can use the data directly in the MapReduce stream instead of using HDFS as the intermediate storage for the data.

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.