How to solve the problem of HBase garbled caused by jvm file.encoding attribute 07/19 Update SLTechnology News&Howtos

How to solve the problem of HBase garbled caused by jvm file.encoding attribute

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to solve the HBase garbled code problem caused by jvm file.encoding attribute". In daily operation, I believe many people have doubts about how to solve the HBase garbled code problem caused by jvm file.encoding attribute. Xiaobian consulted all kinds of information and sorted out simple and easy to use operation methods. I hope to answer your doubts about "how to solve the HBase garbled code problem caused by jvm file.encoding attribute"! Next, please follow the small series to learn together!

1. Question:

Recently, when writing Chinese to HBase, I found that some Chinese codes found by hbase are random, while some Chinese codes are normal. Logically speaking, general random codes are either completely random or not random. Considering that the places where Chinese appears are all from a configuration file on hdfs, and this configuration file can be confirmed to be utf-8 encoded, which excludes the garbled code caused by the original file. Considering that there is no transcoding logic in MR code, which also excludes code problems, there is only one possibility: the system environment of Hadoop cluster is heterogeneous, which may involve environment variables and configuration problems of linux and java.

2. Investigation:

(1) Print the linux system variables echo $LANG, echo $LC_ALL and so on of the whole cluster, and find that they are consistent, eliminating the problem of os environment.

(2) The rest of the focus is on the java environment, add the following two sentences to the code, print the ip and jvm codes of each record, and then look at the garbled records generated by that machine, and the coding of the jvm child at that time:

java.net.InetAddress test = java.net.InetAddress.getByName("localhost");put.add(Bytes.toBytes("cf"), Bytes.toBytes("ip"), Bytes.toBytes(test.getLocalHost().getHostAddress()));put.add(Bytes.toBytes("cf"), Bytes.toBytes("ec"), Bytes.toBytes(System.getProperty("file.encoding")));

At the same time, the corresponding Chinese field is also directly output from System.out.println to see whether it is written before or after hbase.

After running a test data, I found that the ip and jvm codes in hbase were irregular, and then I checked the log printed by syso and found that they had been garbled before writing hbase. Then I thought about the reason why the data garbled in hbase was irregular because they had to shuffle and reduce after map to get to hbase. PS: sysout itself has no coding concept, similar to cat, head, more, etc. under Linux.

Then put the ip and jvm coding statistical codes into the map stage output again, and really found the rule, there are two machines in the cluster whose jvm codes are inconsistent, not utf-8:

Here we can know the reason: because the jvm parameters (file.encoding) of two machines in the cluster are inconsistent, some Chinese results are garbled.

3. Solution:

Now that you know why, it depends on how to solve it. The purpose is to change the value of file.encoding.

(1) Permanent Program:

Since this parameter is a startup parameter of jvm, it cannot be changed at runtime.(You can understand that this parameter is a global parameter, and it is cached. If it is changed at runtime, it may cause the whole program in jvm to crash), you can only modify the charset of the system, or add-Dfile.encoding="UTF-8"> to the startup parameter of jvm, and you can set Property at runtime.("file.encoding","ISO-8859-1"); this is useless, so, the permanent solution is: when to change the two machines offline encoding and then online, and then manually execute the next data balance.

Or you can set job parameters when submitting the job: -Dmapred.child.env="LANG=en_US. UTF-8,LC_ALL =en_US. UTF-8"

(2) Temporary worker scheme:

If you don't want to make such a big fuss and want a temporary solution, it's fine. Then you need to bypass the default file.encoding code provided by jvm in our own business code and specify the code yourself:

BufferedReader in = new BufferedReader(new FileReader(path.toString())); replaced by: BufferedReader in = new BufferedReader((new InputStreamReader(new FileInputStream(path.toString()),"utf-8"));

The above sentence is my previous garbled code, if you do not specify the read code, then jvm will use its own file.encoding, which will cause the file to be read on some machines. The following sentence is the code specified by yourself, which bypasses the default code of jvm and becomes a stranger to jvm from now on ~

PS: FileReader does not seem to provide a constructor for the specified encoding, so the following class is replaced.

(3) Questions:

Why haven't there been any garbled characters before, but this time reading the file is garbled?

That's because Bytes of hbase, fileinputformat key/value of map, context.write of mapreduce are hardcoded utf-8 by default, which has nothing to do with jvm coding, so we won't encounter the above problems.

4. In-depth understanding of jvm's-Dfile.encoding parameter

The above said so much, maybe some students still don't understand: jvm this parameter is useful ah? How come I've never heard of this before?

Yeah, I haven't heard of normal, I haven't heard of it before either ha ~

Trace from source code

In the src.zip file of JDK 1.6.0_20, look for files that contain the word file.encoding.

Four were found, namely:

(a) Start with the highlight java.nio.Charset class:

public static Charset defaultCharset() { if (defaultCharset == null) { synchronized (Charset.class) { java.security.PrivilegedAction pa = new GetPropertyAction("file.encoding"); String csn = (String) AccessController.doPrivileged(pa); Charset cs = lookup(csn); if (cs != null) defaultCharset = cs; else defaultCharset = forName("UTF-8"); } } return defaultCharset; }

In java, if charset is not specified, such as new String(byte[] bytes), the method Charset.defaultCharset() will be called. We can clearly see that defaultCharset can only be initialized once. There is still a small problem here. When multi-threaded concurrent calls are made, the initial words will still be repeated. Of course, the latter are read from cache (lookup function), and the problem is not big.

When we change file.encoding in System.getProperties, defaultCharset is already initialized, so no initialization code will be invoked.

When jvm starts, load class, defaultCharset is initialized before the final call to the main function, and this method is dropped in many functions such as String.getBytes, and InputStreamReader, InputStreamWriter are called Charset.defaultCharset() methods.

(b) static construction methods of java. net.URLEncoder, affected methods java. net.URLEncoder.encode(String)

En, here also need to pay attention to, some students have fallen into the pit before, please use: encode(String s, String enc) method, this method has no side leakage, sleep until dawn ~

(d) Static constructor method of the last javax.print.DocFlavor class

As you can see, the system variable file.encoding affects

1. Charset.defaultCharset() The most critical encoding settings in the Java environment

2. URLEncoder.encode(String) Most commonly encountered encoding usage in Web environments

3. com.sun.org.apache.xml.internal.serializer.Encoding affects reading xml files without encoding settings

4. javax.print.DocFlavor The encoding that affects printing

At this point, the study of "how to solve the HBase garbled code problem caused by jvm file.encoding attribute" is over, hoping to solve everyone's doubts. Theory and practice can better match to help everyone learn, go and try it! If you want to continue learning more relevant knowledge, please continue to pay attention to the website, Xiaobian will continue to strive to bring more practical articles for everyone!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.