How to process GBK encoded data and output GBK encoded data by mapreduce Program in Hadoop 04/22 Update SLTechnology News&Howtos

How to process GBK encoded data and output GBK encoded data by mapreduce Program in Hadoop

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, Xiaobian will bring you about how the mapreduce program in Hadoop processes GBK encoded data and outputs GBK encoded data. The article is rich in content and analyzes and narrates it from a professional perspective. After reading this article, I hope you can gain something.

Hadoop Chinese encoding related problems-- mapreduce program processing GBK encoded data and outputting GBK encoded data

Example code for GBK file input and GBK file output:

When Hadoop processes GBK text, it is found that the output appears garbled. The original HADOOP is written dead UTF-8 when it involves encoding. If the file encoding format is other types (such as GBK), garbled characters will appear.

Just use transformTextToUTF8(text, "GBK") when reading Text in the mapper or reducer program; transcode to make sure you are running in UTF-8 encoding.

public static Text transformTextToUTF8(Text text, String encoding) {

String value = null;

try {

value = new String(text.getBytes(), 0, text.getLength(), encoding);

} catch (UnsupportedEncodingException e) {

e.printStackTrace();

}

return new Text(value);

}

The core code here is: String line=new String(text.getBytes(),0,text.getLength(),"GBK"); //the value here is Text type

String line=value.toString(); will output garbled characters, which is caused by the Writeable type Text. When I started, I always thought that the encapsulation of Long and LongWriteable was the same, and the Text type was the Writeable encapsulation of String. However, there are some differences between Text and String. It is a UTF-8 writable, while String in Java is Unicode character. Therefore, directly using the value.toString() method will default to UTF-8 encoded characters, so the original GBK-encoded data will become garbled after reading it directly using Text.

The correct approach is to convert the value of the input Text type to a byte array (value.getBytes()), use String's constructor String(byte[] bytes, int offset, int length, Charset charset), and construct a new String by decoding the specified byte subarray using the specified charset.

If you need to map/reduce output data in other encoding formats, you need to implement OutputFormat yourself, where you specify the encoding method, instead of using the default TextOutputFormat.

The above is how the MapReduce program in Hadoop shared by Xiaobian processes GBK encoded data and outputs GBK encoded data. If there is a similar doubt, please refer to the above analysis for understanding. If you want to know more about it, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.