What is the most appropriate data format for big data in MapReuce? 07/04 Update SLTechnology News&Howtos

What is the most appropriate data format for big data in MapReuce?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

As a large-scale topic of "Hadoop from getting started to mastering", the second section of Chapter 3 will teach you how to use the two common formats of XML and JSON in Mapreduce, and analyze and compare the data formats that are most suitable for Mapreduce big data to deal with.

In the first chapter of this chapter, we took a brief look at the concept of Mapreduce data serialization and its unfriendliness to XML and JSON formats. As a large-scale topic of "Hadoop from getting started to mastering", the second section of Chapter 3 will teach you how to use the two common formats of XML and JSON in Mapreduce, and analyze and compare the data formats that are most suitable for Mapreduce big data to deal with.

What is the most appropriate data format for big data in MapReuce?

3.2.1 XML

Since its birth in 1998, XML has been used as a data format to represent data that can be read by both machines and humans. It has become a common language for data exchange between systems, is now adopted by many standards, such as SOAP and RSS, and is used as an open data format for products such as Microsoft Office.

MapReduce and XML

MapReduce bundles InputFormat for use with text, but does not support XML, which means that native Mapreduce is very unfriendly to XML. Processing a single XML file in parallel in MapReduce is tricky because XML does not contain synchronization tags for its data format.

problem

You want to use large XML files in MapReduce and be able to split and process them in parallel.

Solution

Mahout's XMLInputFormat can be used by MapReduce to process XML files in HDFS. It reads records separated by the start and end tags of a particular XML, and this technique also explains how to send XML as output in MapReduce.

MapReduce does not include built-in support for XML, so we turned to another Apache project-- Mahout, a machine learning system that provides XML InputFormat. To learn about XML InputFormat, you can write a MapReduce job that uses Mahout's XML input format to read property names and values from Hadoop's configuration file (HDFS).

The first step is to configure the job:

Mahout's XML input format is crude, and we need to specify the exact start and end of the file search XML tag, and split the file (and extract the record) using the following method:

The file is divided into discontiguous parts along the HDFS block boundary for data localization.

Each map task runs on a specific input split, and the map task seeks the start of the input split, and then continues to work on the file until the first xmlinput.start.

Repeat the content between xmlinput.start and xmlinput.end until you enter the end of the split.

Next, you need to write a mapper to use Mahout's XML input format. The Text form already provides XML elements, so you need to use an XML parser to extract content from XML.

Table 3.1 extract content using Java's STAX parser

The map has an Text instance that contains an String representation of the data between the start and end tags. In this code, we can use Java's built-in Streaming API for XML (StAX) parser to extract the key and value of each attribute and output it.

If you run a MapReduce job against Cloudera's core-site.xml and use the HDFS cat command to display the output, you will see the following:

This output shows that XML was successfully used as the input serialization format for MapReduce. Not only that, you can also support large XML files, because the input format supports splitting XML.

Write XML

When we can read XML normally, what we need to solve is how to write XML. In reducer, callbacks occur before and after calling the main reduce method, which you can use to issue the start and end tags, as shown below.

Table 3.2 reducer for issuing opening and closing tags

This can also be embedded in OutputFormat.

Pig

If you want to use XML,Piggy Bank library (a user-contributed Pig code base) in Pig, include a XMLLoader. It works very similar to this technique, capturing everything between the opening and closing tags and providing it as a single-byte array field in the Pig tuple.

Hive

There is currently no way to use XML in Hive, you must write a custom SerDe.

Summary

Mahout's XmlInputFormat helps with XML, but it is sensitive to the exact string matching of the start and end element names. This method is not available if the element tag contains attributes with variable values that cannot control element generation or may result in the use of XML namespace qualifiers.

If you can control the XML in the input, you can simplify this exercise by using a single XML element per line. This allows the use of a built-in MapReduce text-based input format (such as TextInputFormat), which treats each line as a record and splits it.

Another option worth considering is the preprocessing step, which can convert the original XML to a separate row of each XML element, or to a completely different data format, such as SequenceFile or Avro, both of which solve the split problem.

Now that you know how to use XML, let's deal with another popular serialization format, JSON.

3.2.2 JSON

JSON shares the machine-readable and human-readable features of XML and has existed since the beginning of the 21st century. It is simpler than XML, but does not have the rich typing and validation capabilities in XML.

If some code is downloading JSON data from a streaming REST service, and the file is written to HDFS every hour. Because of the large amount of data downloaded, the size of each file generated is several gigabytes.

If you are asked to write a MapReduce job, you need to use a large JSON file as input. You can divide the question into two parts: first, MapReduce does not have the InputFormat; used with JSON. Second, how to split the JSON?

Figure 3.7 shows the problem of splitting JSON. Imagine that MapReduce creates a split, as shown in the figure. The map task that operates on this input split performs a search for the input split to determine the beginning of the next record. For file formats such as JSON and XML, it is challenging to know when the next record starts because of the lack of synchronization marks or the beginning of any other identifying record.

JSON is more difficult to split into segments than formats such as XML, because JSON does not have a token (such as the closing tag in XML) to indicate the beginning or end of a record.

problem

You want to use JSON input in MapReduce and make sure you can import JSON files for concurrent read partitions.

Solution

Elephant Bird LzoJsonInputFormat is used as the basis for creating input format classes to use JSON elements, which can use multiline JSON.

Figure 3.7 example of a problem using JSON and multiple input splits

Discuss

Elephant Bird (https://github.com/kevinweil/elephant-bird) is an open source project that contains useful programs for handling LZOP compression, and it has a LzoJsonInputFormat that reads JSON, although the input file is required to be LZOP-compressed. However, you can use Elephant Bird code as your own JSON InputFormat template, which does not have LZOP compression requirements

This solution assumes that each JSON record is on a separate line. JsonRecordFormat is simple and does nothing but construct and return JsonRecordFormat, so we'll skip this code. JsonRecordFormat issues a LongWritable,MapWritable key/value to the mapper, where MapWritable is a mapping of the JSON element name and its value.

Let's take a look at how RecordReader works. It uses LineRecordReader, a built-in MapReduce reader. To convert the line to MapWritable, the reader uses a json-simple parser to parse the line into a JSON object, then iterate over the keys in the JSON object and put them into MapWritable with their associated values. The mapper is assigned JSON data in LongWritable, and MapWritable pairs can process the data accordingly.

The following shows an example of a JSON object:

This technique assumes one JSON object per line, and the following code shows the JSON file used in this example:

Now copy the JSON file to HDFS and run the MapReduce code. The MapReduce code writes to each JSON key/value pair and outputs:

Write JSON

Similar to Section 3.2.1, the method of writing XML can also be used to write JSON.

Pig

Elephant Bird contains a JsonLoader and LzoJsonLoader that you can use to process JSON in Pig, and these loaders use row-based JSON. Each Pig tuple contains the chararray field of each JSON element in the row.

Hive

Hive contains a DelimitedJSONSerDe class that can serialize JSON, but unfortunately it cannot be deserialized, so you cannot use this SerDe to load data into Hive.

Summary

This solution assumes that the structure of the JSON input is a row for each JSON object. So, how do you use JSON objects that span multiple lines? there is an item on the GitHub (https://github.com/alexholmes/json-mapreduce) can split multiple inputs on a single JSON file, a method that searches for specific JSON members and retrieves the included objects.

You can check out the Google project called hive-json-serde, which supports both serialization and deserialization.

As you can see, using XML and JSON in MapReduce is very bad, and there are strict requirements on how to layout the data. MapReduce's support for these two formats is also complex and error-prone because they are not suitable for splitting. Obviously, you need to look at alternative file formats that have internal support and are splittable.

The next step is to explore complex file formats that are more suitable for MapReduce, such as Avro and SequenceFile.

3. Big data serialization format

Unstructured text formatting works well when working with scalar or tabular data. Semi-structured text formats such as XML and JSON can model complex data structures that include composite fields or hierarchical data. However, when dealing with large amounts of data, we need more serialized formats with compact serialized forms, which themselves support partitioning and have schema evolution capabilities.

In this section, we will compare the serialization formats that are most suitable for MapReduce big data processing and follow up how to use them with MapReduce.

3.3.1 compare SequenceFile,Protocol Buffers,Thrift and Avro

As a rule of thumb, the following characteristics are important when choosing a data serialization format:

Code generation-some serialized formats have code generation libraries that allow rich objects to be generated, making it easier to interact with data. The generated code also provides additional benefits such as security to ensure that consumers and producers use the correct data types.

Architectural evolution-the data model evolves over time, and it is important that the data format supports the need to modify the data model. Schema evolution allows you to add, modify, and in some cases delete attributes, while providing backward and forward compatibility for reading and writing.

Language support-you may need to access data in multiple programming languages, and it is important for mainstream languages to support data formats.

Data compression-data compression is important because a large amount of data can be used. Moreover, the ideal data format can compress and decompress data internally at write and read time. If the data format does not support compression, this is a big problem for programmers, because it means that compression and decompression must be managed as part of the data pipeline (just like using a text-based file format).

Separability-newer data formats support multiple parallel readers that can read and process different blocks of large files. It is important that the file format contain synchronization tags (which can be randomly searched and scanned to the beginning of the next record).

Support for MapReduce and Hadoop ecosystems-the selected data format must support MapReduce and other key Hadoop ecosystem projects, such as Hive. Without this support, you will be responsible for writing code to make the file format applicable to these systems.

Table 3.1 compares the popular data serialization frameworks to see how they overlay each other. The following discussion provides additional background knowledge about these technologies.

Table 3.1 functional comparison of the data serialization framework

Let's look at these formats in more detail.

SequenceFile

The SequenceFile format is created for use with MapReduce, Pig, and Hive, so it integrates well with all tools. The main disadvantages are the lack of code generation and version control support, as well as limited language support.

Protocol Buffers

Protocol Buffers has been heavily used for interoperability by Google, and its advantage is that its version supports binary format. The disadvantage is that MapReduce (or any third-party software) lacks support for reading files generated by Protocol Buffers serialization. However, Elephant Bird can use Protocol Buffers serialization in container files.

Thrift

Thrift is a data serialization and RPC framework developed internally by Facebook, which does not support MapReduce in the native data serialization format, but can support different wire-level data representations, including JSON and various binary encodings. Thrift also includes a RPC layer with various types of servers. This chapter ignores the RPC feature and focuses on data serialization.

Avro

The Avro format was created by Doug Cutting to help make up for the shortcomings of SequenceFile.

Parquet

Parquet is a columnar file format with rich Hadoop system support, which can work amicably with Avro, Protocol Buffers, Thrift, etc. Although Parquet is a column-oriented file format, don't expect one data file per column. Parquet saves all the data in a row in the same data file to ensure that all columns of the row are available when processing on the same node. What Parquet does is set the HDFS block size and the maximum big data file size to 1GB to ensure that Icano and network transfer requests are suitable for large quantities of data.

Based on the above evaluation criteria, Avro seems to be the most suitable data serialization framework in Hadoop. SequenceFile is not far behind because of its inherent compatibility with Hadoop (it is designed for Hadoop).

You can view the jvm-serializers project on Github, which runs various benchmarks to compare file formats based on serialization and deserialization times. It includes Avro,Protocol Buffers and Thrift benchmarks as well as many other frameworks.

After learning about the various data serialization frameworks, we will focus on these formats in the next few sections.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.