How to convert string Encoding with Java 04/28 Update SLTechnology News&Howtos

How to convert string Encoding with Java

2025-04-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "how to convert string encoding in Java". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Strings are uniformly represented by unicode in java (that is, utf-16 LE), and for String s = "Hello!"; if the source file is GBK-encoded, and the default environment encoding of the operating system (windows) is GBK, then JVM will parse the byte array into characters according to GBK encoding at compile time, and then convert the characters into unicode-formatted byte arrays as internal storage. When printing this string, JVM converts unicode to GBK according to the locale of the operating system, and then the operating system displays the content in GBK format.

When the source file is UTF-8, we need to inform the compiler of the format of the source code, javac-encoding utf-8. When compiling, JVM parses into characters according to utf-8, and then converts it to a byte array in unicode format. No matter what the format of the source file is, the same string, the resulting unicode byte array is exactly the same. When displayed, it is also converted to GBK for display (related to the OS environment).

How to generate garbled codes? In essence, it is caused by the inconsistency between the original encoding format of the string and the encoding format used for parsing when reading.

For example: String s = "Hello!"

System.out.println (new String (s.getBytes (), "UTF-8")); / / error, because getBytes () uses GBK encoding by default and UTF-8 encoding when parsing, which must be an error.

GetBytes () is a byte array that converts unicode into the default format of the operating system, that is, the GBK format of "Hello". Charset in new String (bytes, Charset) specifies the way to read bytes, which is specified as UTF-8, that is, the contents of bytes are treated as UTF-8 format.

The following two ways will have the correct results, because their source content encoding is consistent with the encoding used for parsing.

System.out.println (new String (s.getBytes (), "GBK"))

System.out.println (new String (s.getBytes ("UTF-8"), "UTF-8"))

So, how do you use getBytes and new String () for transcoding? There is a wrong method circulating on the Internet: GBK-- > UTF-8: new String (s.getBytes ("GBK"), "UTF-8); this method is completely wrong, because the code of getBytes is inconsistent with UTF-8, so it must be garbled. But why is it possible to use new String (s.getBytes (" iso-8859-1 ")," GBK ") under tomcat?

The answer is: tomcat uses iso-8859-1 encoding by default, that is, if the original string is GBK and GBK is converted to iso-8859-1 during tomcat transmission, by default, there must be a problem using iso-8859-1 to read Chinese, then we need to convert iso-8859-1 to GBK, while iso-8859-1 is single-byte encoding, that is, he thinks a byte is a character. Then this conversion will not make any change to the original byte array, because the byte array is originally made up of a single byte. If the byte array was previously encoded in GBK, then the encoding content has not changed after it has been converted to iso-8859-1, then s.getBytes ("iso-8859-1") is actually still the encoded content of the original GBK. Then new String (s.getBytes ("iso-8859-1"), "GBK") can be decoded correctly. So it's a coincidence.

How to transfer GBK to UTF-8 correctly (actually unicode to UTF-8)

String gbkStr = "Hello!"; / / the source file is in GBK format, or the string is read from the GBK file and converted to string to unicode format.

/ / use getBytes to convert unicode strings into byte arrays in UTF-8 format

Byte [] utf8Bytes = gbkStr.getBytes ("UTF-8")

/ / then use utf-8 to decode the byte array into a new string

String utf8Str = new String (utf8Bytes, "UTF-8")

The simplified version is as follows:

Public String unicodeToUtf8 (String s) {

Return new String (s.getBytes ("utf-8"), "utf-8")

}

The principle of UTF-8 to GBK is the same.

Return new String (s.getBytes ("GBK"), "GBK")

In fact, the core work is done by getBytes (charset).

JDK description of getBytes: Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

In addition, for reading and writing files

OutputStreamWriter writer = new OutputStreamWriter (new FileOutputStream ("D:\\ file.txt"), "UTF-8")

InputStreamReader (stream, charset)

It can help us easily read and write files according to the specified encoding.

This is the end of the content of "how to convert Java into string Encoding". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.