How to read and write the concept of MapReduce data Serialization 07/06 Update SLTechnology News&Howtos

How to read and write the concept of MapReduce data Serialization

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you a brief analysis of the concept of reading and writing MapReduce data serialization. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

MapReduce provides concise documentation support for dealing with simple data formats such as log files, but MapReduce has evolved from log files to more complex data serialization formats such as text, XML, and JSON. The goal of this chapter is to document how to use common data serialization formats, examine more structured serialization formats, and compare their applicability with MapReduce. The following mainly introduces the methods of MapReduce processing to store data in different formats (such as XML and JSON), paving the way for a better understanding of data formats such as Avro and Parquet that are suitable for big data and Hadoop.

Data serialization-using text and other methods

If you want to use the ubiquitous XML and JSON data serialization formats, which work directly in most programming languages, there are a variety of tools available for marshalling, unmarshalling, and validation. However, using XML and JSON in MapReduce faces two major challenges. First of all, MapReduce needs classes that can read and write specific data serialization formats. If you want to use a custom file format, you probably don't have a corresponding class that supports the serialization format you are using. Second, the power of MapReduce lies in its ability to read input data in parallel, which is critical if the input file is large (hundreds of megabytes or more), and classes that read the serialized format can split large files so that multiple tasks can read in parallel.

XML and JSON format

Data serialization support in MapReduce reads and writes MapReduce data input and output class attributes, so let's start with an overview of how MapReduce supports data input and output.

3.1understanding input and output in MapReduce

Your data may be located in the XML file behind many FTP servers, the text log file on the central Web server, or the Lucene index in HDFS. How does MapReduce read and write these different serialization structures across multiple storage mechanisms?

Figure 3.1 input and output actor in MapReduce

Figure 3.1 shows the data flow through the MapReduce and identifies the participants in each part of the responsible flow. On the input side, we can see that some work (creating a split) is performed outside the map phase, while others are performed as part of the map phase (read split), and all output work is performed in the reduce phase (write output).

Figure 3.2 shows the same process for using only map jobs, where the MapReduce framework still uses OutputFormat and RecordWriter classes to write output directly to the data sink in a map-only job. Let's take a look at the data flow and discuss the responsibilities of the roles, and we'll also look at the relevant code in the built-in TextInputFormat and TextOutputFormat classes to better understand these concepts, which read and write line-oriented text files.

3.1.1 data entry support

The two classes of data input in MapReduce, InputFormat and RecordReader, query the InputFormat class to determine how data should be entered for the map task partition, and RecordReader performs reading data from the input.

INPUTFORMAT

Each job in MapReduce must define its input according to the rules specified in the InputFormat abstract class. The InputFormat implementer must complete three steps: describe the map input key and value type information; specify how the input data should be partitioned; and indicate the RecordReader instance that should read the data from the source.

Figure 3.2 MapReduce input and output actor without Reducer

Figure 3.3 annotated InputFormat class and its three rules

It can be said that the most important rule is to determine how to divide the input data. In MapReduce nomenclature, these partitions are called input splits. Input splits directly affect the efficiency of map parallelism because each split is handled by a single map task. Using an InputFormat that cannot create multiple input splits on a single data source, such as a file, will cause the map phase to be slow because the file will be processed sequentially.

The TextInputFormat class provides the createRecordReader method implementation of the InputFormat class, but it delegates the calculation of the input split to its parent class, FileInputFormat. The following code shows the relevant parts of the TextInputFormat class:

Make sure that the FileInputFormat code entered for the split is slightly more complex, and the following example shows a simplified form of the code to describe the main elements of the getSplits method:

The following code shows how to specify the InputFormat for the MapReduce job:

Job.setInputFormatClass (TextInputFormat.class)

RECORDREADER

We will create and use the RecordReader class in the map task to read data from the input split and make each record available to mapper in key/value form. Typically, a task is created for each input split, and each task has a RecordReader that reads the data from that input split.

Figure 3.4 annotated RecordReader classes and their abstract methods

As shown earlier, the TextInputFormat class creates a LineRecordReader to read the record from the input split. LineRecordReader extends the RecordReader class directly and uses the LineReader class to read rows from the input split. LineRecordReader uses the byte offset in the file as the map key and the contents of the line as the map value. The following example shows a simplified version of LineRecordReader:

Because the LineReader class is simple, we will skip this code. The next step is to see how MapReduce supports data output.

3.1.2 data output

MapReduce uses a process similar to input to support the output of data. There must be two classes: OutputFormat and RecordWriter. OutputFormat performs some basic validation of the data sink properties, and RecordWriter writes each reducer output to the data receiver.

OUTPUTFORMAT

Much like the InputFormat class, the OutputFormat class (shown in figure 3.5) defines the conditions that the implementation must meet: check the information related to the job output; provide RecordWriter and specify the output submitter; and allow writes and maintain "permanent" when the task is completed.

Figure 3.5 annotated OutputFormat class

Like TextInputFormat, TextOutputFormat also extends a base class, FileOutputFormat, for complex data flow operations, such as output commit. Next, let's take a look at the TextOutputFormat execution workflow, and the following code shows how to specify the OutputFormat for the MapReduce job:

Job.setOutputFormatClass (TextOutputFormat.class)

RECORDWRITER

We will use RecordWriter to write reducer output to the target data receiver. This is a simple class, as shown in figure 3.6.

TextOutputFormat returns a LineRecordWriter object, which is an inner class of TextOutputFormat that performs writing to the file, and the following example shows a simplified version of this class:

On the map side, InputFormat determines how many map tasks are executed; on the reducer side, the number of tasks is based entirely on the mapred.reduce.tasks value set by the client (this value is obtained from mapred-site.xml if it is not set, or from mapred-default.xml if it does not exist in the site file).

The above is the analysis of how to serialize the concept of reading and writing MapReduce data. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.