Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the knowledge points of Java string encoding

2025-01-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "what are the knowledge points of Java string coding". The editor shows you the operation process through an actual case. The method of operation is simple and fast, and it is practical. I hope this article "what are the knowledge points of Java string coding" can help you solve the problem.

First, why to code

I wonder if you have ever thought about a question, that is, why do you want to code? Can we not code? To answer this question, we have to go back to how computers represent symbols that we humans can understand, that is, the language we use. Because there are too many human languages, the symbols that represent these languages are too many to be represented by a basic storage unit in the computer-- byte, so it must be split or some translation work before the computer can understand it. We can assume that the language that the computer can understand is English. In order to be able to use other languages in the computer, it must be translated into English. The process of translation is coding. So it is conceivable that as long as non-English-speaking countries are able to use computers, they have to be coded. This seems to be a bit overbearing, but this is the status quo, just like our country is now vigorously promoting Chinese, hoping that other countries will speak Chinese, and other languages will be translated into Chinese in the future. we can change the smallest unit of information stored in the computer into Chinese characters, so we don't have coding problems.

So in general, the reasons for coding can be summarized as follows:

The smallest unit of information stored in a computer is one byte, that is, 8 bit, so the range of characters that can be represented is 0 characters 255.

There are too many symbols for human beings to represent in one byte. To solve this contradiction, a new data structure, char, must be encoded from char to byte.

Second, how to translate

Understand that all kinds of languages need to communicate, after translation is necessary, then how to translate? There are many translation methods in the calculation, such as ASCII, ISO-8859-1, GB2312, GBK, UTF-8, UTF-16 and so on. They can all be regarded as dictionaries, they set out the rules of transformation, according to which computers can correctly represent our characters. At present, there are many coding formats, such as GB2312, GBK, UTF-8, UTF-16, these formats can represent a Chinese character, so which coding format do we choose to store Chinese characters? This takes into account other factors, whether the storage space is important or the efficiency of coding is important. According to these factors to choose the correct encoding format, the following is a brief introduction to these coding formats.

ASCII code

Anyone who has studied the computer knows the ASCII code, with a total of 128s (0127bits), which is represented by a byte of the lower 7 bits, 0q31 is a control character such as newline enter deletion, etc.; 32x126 is a printed character that can be typed and displayed on the keyboard.

Where 48057 is 0 to 9 ten Arabic numerals

65-90 is 26 capital letters

Number 9700122 is 26 lowercase letters.

ISO-8859-1

It is obvious that 128characters are not enough, so ISO has developed a series of standards based on ASCII codes to extend ASCII codes, which are ISO-8859-1~ISO-8859-15, of which ISO-8859-1 covers most Western European language characters, all of which are the most widely used. ISO-8859-1 is still a single-byte encoding, representing a total of 256 characters.

GB2312

Its full name is "basic set of Chinese character coding set for information exchange". It is double-byte coding, and the total coding range is A1-F7, in which A1-A9 is a symbol area, containing a total of 682 symbols, and B0-F7 is a Chinese character area, including 6763 Chinese characters.

GBK

The full name is "Chinese character Internal Code extension Specification", which is a new Chinese character internal code specification developed by the State Bureau of Technical Supervision for windows95. It appears in order to expand GB2312 and add more Chinese characters. Its coding range is 8140~FEFE (excluding XX7F) with a total of 23940 code points. It can represent 21003 Chinese characters. Its coding is compatible with GB2312, that is to say, Chinese characters encoded by GB2312 can be decoded with GBK, and there will be no garbled codes.

GB18030

The full name is "Chinese character coding set for information exchange", which is a compulsory standard in our country. It may be single-byte, double-byte or four-byte coding, and its coding is compatible with GB2312 coding. Although this is a national standard, it is not widely used in practical application systems.

UTF-16

When it comes to UTF having to mention Unicode (Universal Code Unicode), ISO is trying to create a new hyperlanguage dictionary through which all languages in the world can be translated into each other. You can imagine how complex this dictionary is, and the detailed specification of Unicode can be found in the corresponding documentation. Unicode is the foundation of Java and XML, and the storage form of Unicode in the computer is described in detail below.

UTF-16 specifically defines how to access Unicode characters in a computer. UTF-16 uses two bytes to represent the Unicode conversion format, this is a fixed-length representation, no matter what character can be represented by two bytes, two bytes is 16 bit, so it is called UTF-16. UTF-16 is very convenient to represent characters, representing one character every two bytes, which greatly simplifies the operation when operating on strings, which is an important reason why Java uses UTF-16 as the character storage format in memory.

UTF-8

UTF-16 uniformly uses two bytes to represent a character, although it is very simple and convenient in representation, but it also has its shortcomings. A large number of characters can be represented by one byte, and the storage space has been doubled. Today, when the network bandwidth is still very limited, this will increase the traffic of network transmission, and it is not necessary. UTF-8 uses a variable length technique, and each coding region has a different code length. Different types of characters can be made up of 1 to 6 bytes.

UTF-8 has the following coding rules:

1. If a byte, the highest bit (bit 8) is 0, indicating that this is an ASCII character (00-7F). As you can see, all ASCII codes are already UTF-8.

2. If a byte begins with 11, the number of consecutive 1s indicates the number of bytes of the character. For example, 110xxxxx represents the first byte of a double-byte UTF-8 character.

3. If a byte starts with 10, it is not the first byte. You need to look forward to get the first byte of the current character.

3. Scenarios that need to be encoded in Java

Several common encoding formats have been described earlier, and the following will show you how to handle support for encoding in Java and where encoding is needed.

3.1 Encoding that exists in the Icano operation

We know that coding is usually involved in character-to-byte or byte-to-character conversion, and the scenarios that need this conversion are mainly in the case of I-An O, which includes disk I-side O and network I-hand O, which will be introduced in the following part of the network I-hand O, which mainly takes the Web application as an example.

The Reader class is the parent class of the read characters in the Java iByte O, while the InputStream class is the parent class of the read byte, and the InputStreamReader class is the bridge from the associated byte to the character. It is responsible for the conversion from the read byte to the character in the process of the iBand O, and the decoding of the specific byte to the character is realized by StreamDecoder. In the process of StreamDecoder decoding, the Charset encoding format must be specified by the user. It is worth noting that if you do not specify Charset, the default character set in the local environment will be used, for example, GBK encoding will be used in the Chinese environment.

The same is true for writing, where the parent class of characters is Writer and the parent class of bytes is OutputStream, which converts characters to bytes through OutputStreamWriter.

Similarly, the StreamEncoder class is responsible for encoding characters into bytes, and the encoding format and default encoding rules are consistent with decoding.

For example, the following code implements the read and write function of the file:

String file = "c:/stream.txt"; String charset = "UTF-8"; / / convert characters to byte stream FileOutputStream outputStream = new FileOutputStream (file); OutputStreamWriter writer = new OutputStreamWriter (outputStream, charset); try {writer.write ("this is the Chinese character to be saved");} finally {writer.close ();} / read bytes into characters FileInputStream inputStream = new FileInputStream (file); InputStreamReader reader = new InputStreamReader (inputStream, charset); StringBuffer buffer = new StringBuffer () Char [] buf = new char [64]; int count = 0; try {while ((count = reader.read (buf))! =-1) {buffer.append (buffer, 0, count);}} finally {reader.close ();}

In our application, as long as we specify a unified codec Charset character set, there is generally no garbled problem. If some applications do not pay attention to the specified character encoding, the operating system default encoding will be taken in the Chinese environment. If the encoding and decoding are all in the Chinese environment, it is usually no problem, but it is still strongly not recommended to use the operating system default encoding, because of this. The coding format of your application is tied to the running environment, and garbled problems are likely to occur in cross-environments.

3.2 Encoding in memory operations

In Java development, in addition to the coding involved in IActionO, the most commonly used data type conversion is to convert characters to bytes in memory. String is used in Java to represent strings, so the String class provides methods to convert bytes to bytes and also supports constructors that convert bytes to strings. The following code example:

String s = "this is a Chinese string"; byte [] b = s.getBytes ("UTF-8"); String n = new String (b, "UTF-8")

The other is the obsolete ByteToCharConverter and CharToByteConverter classes, which provide convertAll methods to convert byte [] and char [], respectively. The following code is shown:

ByteToCharConverter charConverter = ByteToCharConverter.getConverter ("UTF-8"); char c [] = charConverter.convertAll (byteArray); CharToByteConverter byteConverter = CharToByteConverter.getConverter ("UTF-8"); byte [] b = byteConverter.convertAll (c)

These two classes have been replaced by the Charset class, where Charset provides encoding of encode and decode corresponding to char [] to byte [] and decoding of byte [] to char [], respectively. The following code is shown:

Charset charset = Charset.forName ("UTF-8"); ByteBuffer byteBuffer = charset.encode (string); CharBuffer charBuffer = charset.decode (byteBuffer)

Encoding and decoding are done in the same class, and the codec character set is set by forName, which makes it easier to unify the encoding format and is more convenient than ByteToCharConverter and CharToByteConverter classes.

There is also a ByteBuffer class in Java, which provides a soft conversion between char and byte. The conversion between them does not require encoding and decoding, but the char format of a 16bit is split into two byte representations of 8bit. Their actual values have not been modified, only the type of the data has been converted. The code is as follows:

ByteBuffer heapByteBuffer = ByteBuffer.allocate (1024); ByteBuffer byteBuffer = heapByteBuffer.putChar (c)

The above provides the conversion between characters and bytes as long as we set a uniform codec format will not be a problem.

4. How to encode and decode in Java

We have introduced several common encoding formats. Here we will introduce how to encode and decode in Java with practical examples. Let's take the string "I am Jun Mountain" as an example to introduce how to encode it in Java with ISO-8859-1, GB2312, GBK, UTF-16, UTF-8 encoding formats.

Public static void encode () {String name = "I am Jun Mountain"; toHex (name.toCharArray ()); try {byte [] iso8859 = name.getBytes ("ISO-8859-1"); toHex (iso8859); byte [] gb2312 = name.getBytes ("GB2312"); toHex (gb2312); byte [] gbk = name.getBytes ("GBK") ToHex (gbk); byte [] utf16 = name.getBytes ("UTF-16"); toHex (utf16); byte [] utf8 = name.getBytes ("UTF-8"); toHex (utf8);} catch (UnsupportedEncodingException e) {e.printStackTrace ();}}

Let's encode the name string into an byte array according to the above-mentioned encoding formats, and then output it in hexadecimal. Let's take a look at how Java is encoded.

First, set the Charset class through Charset.forName (charsetName) according to the specified charsetName, then create the CharsetEncoder object according to Charset, and then call CharsetEncoder.encode to encode the string. Different encoding types will correspond to one class, and the actual coding process is completed in these classes.

For example, the char array of the string "I am Jun Mountain" is 49 20 61 6d 20 541b 5c71, which will be converted into corresponding bytes according to different encoding formats.

4.1Encoding according to ISO-8859-1

The string "I am Jun Mountain" is encoded in ISO-8859-1

From the above picture, we can see that 7 char characters are converted into 7 byte arrays by ISO-8859-1 encoding. ISO-8859-1 is a single-byte code, and the Chinese word "Jun Mountain" is converted into byte with a value of 3f. 3f means "?" Characters, so it often turns into "?" in Chinese. It is probably caused by the wrong use of the ISO-8859-1 code. Chinese characters will lose information after ISO-8859-1 coding, which is usually called "black hole". It will absorb unknown characters. Since the default character set encoding of most basic Java frameworks or systems is ISO-8859-1, garbled is easy to occur. We will analyze how different forms of garbled appear later.

4.2 encode according to GB2312

The string "I am Jun Mountain" is encoded in GB2312

The corresponding Charset of GB2312 is sun.nio.cs.ext. EUC_CN and the corresponding CharsetDecoder encoding class is sun.nio.cs.ext. DoubleByte

The GB2312 character set has a code table from char to byte. Different character encodings are checked in this code table to find the corresponding bytes of each character, and then assembled into a byte array. The rules for looking up the table are as follows:

C2b [c2bIndex [char > > 8] + (char & 0xff)]

If the code point value found is greater than oxff, it is double-byte, otherwise it is single-byte. Double-byte high 8 bits as the first byte and low 8 bits as the second byte, as shown in the following code:

If (bb > 0xff) {/ / DoubleByte if (dl-dp)

< 2) return CoderResult.OVERFLOW; da[dp++] = (byte) (bb >

> 8); da [dp++] = (byte) bb;} else {/ / SingleByte if (dl-dp)

< 1) return CoderResult.OVERFLOW; da[dp++] = (byte) bb; } 从上图可以看出前 5 个字符经过编码后仍然是 5 个字节,而汉字被编码成双字节,在第一节中介绍到 GB2312 只支持 6763 个汉字,所以并不是所有汉字都能够用 GB2312 编码。 4.3 按照 GBK 编码 字符串"I am 君山"用 GBK 编码 你可能已经发现上图与 GB2312 编码的结果是一样的,没错 GBK 与 GB2312 编码结果是一样的,由此可以得出 GBK 编码是兼容 GB2312 编码的,它们的编码算法也是一样的。不同的是它们的码表长度不一样,GBK 包含的汉字字符更多。所以只要是经过 GB2312 编码的汉字都可以用 GBK 进行解码,反过来则不然。 4.4 按照 UTF-16 编码 字符串"I am 君山"用 UTF-16 编码 用 UTF-16 编码将 char 数组放大了一倍,单字节范围内的字符,在高位补 0 变成两个字节,中文字符也变成两个字节。从 UTF-16 编码规则来看,仅仅将字符的高位和地位进行拆分变成两个字节。特点是编码效率非常高,规则很简单,由于不同处理器对 2 字节处理方式不同,Big-endian(高位字节在前,低位字节在后)或 Little-endian(低位字节在前,高位字节在后)编码,所以在对一串字符串进行编码是需要指明到底是 Big-endian 还是 Little-endian,所以前面有两个字节用来保存 BYTE_ORDER_MARK 值,UTF-16 是用定长 16 位(2 字节)来表示的 UCS-2 或 Unicode 转换格式,通过代理对来访问 BMP 之外的字符编码。 4.5 按照 UTF-8 编码 字符串"I am 君山"用 UTF-8 编码,下面是编码结果: UTF-16 虽然编码效率很高,但是对单字节范围内字符也放大了一倍,这无形也浪费了存储空间,另外 UTF-16 采用顺序编码,不能对单个字符的编码值进行校验,如果中间的一个字符码值损坏,后面的所有码值都将受影响。而 UTF-8 这些问题都不存在,UTF-8 对单字节范围内字符仍然用一个字节表示,对汉字采用三个字节表示。它的编码规则如下: private CoderResult encodeArrayLoop(CharBuffer src, ByteBuffer dst){ char[] sa = src.array(); int sp = src.arrayOffset() + src.position(); int sl = src.arrayOffset() + src.limit(); byte[] da = dst.array(); int dp = dst.arrayOffset() + dst.position(); int dl = dst.arrayOffset() + dst.limit(); int dlASCII = dp + Math.min(sl - sp, dl - dp); // ASCII only loop while (dp < dlASCII && sa[sp] < 'u0080') da[dp++] = (byte) sa[sp++]; while (sp < sl) { char c = sa[sp]; if (c < 0x80) { // Have at most seven bits if (dp >

= dl) return overflow (src, sp, dst, dp); da [dp++] = (byte) c;} else if (c

< 0x800) { // 2 bytes, 11 bits if (dl - dp < 2) return overflow(src, sp, dst, dp); da[dp++] = (byte)(0xc0 | (c >

> 6); da [dp++] = (byte) (0x80 | (c & 0x3f));} else if (Character.isSurrogate (c)) {/ / Have a surrogate pair if (sgp = = null) sgp = new Surrogate.Parser (); int uc = sgp.parse (c, sa, sp, sl) If (uc

< 0) { updatePositions(src, sp, dst, dp); return sgp.error(); } if (dl - dp < 4) return overflow(src, sp, dst, dp); da[dp++] = (byte)(0xf0 | ((uc >

> 18)); da [dp++] = (byte) (0x80 | ((uc > > 12) & 0x3f)); da [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); da [dp++] = (byte) (0x80 | (uc & 0x3f)); sp++ / / 2 chars} else {/ / 3 bytes, 16 bits if (dl-dp

< 3) return overflow(src, sp, dst, dp); da[dp++] = (byte)(0xe0 | ((c >

> 12); da [dp++] = (byte) (0x80 | (c > > 6) & 0x3f); da [dp++] = (byte) (0x80 | (c & 0x3f));} sp++;} updatePositions (src, sp, dst, dp); return CoderResult.UNDERFLOW

UTF-8 coding is different from GBK and GB2312 in that it does not need to look up the code table, so UTF-8 is more efficient in coding efficiency, so UTF-8 coding is more ideal when storing Chinese characters.

5. Comparison of several coding formats

The last four coding formats of Chinese characters can be processed, GB2312 and GBK coding rules are similar, but the scope of GBK is larger, it can handle all Chinese characters, so GB2312 and GBK should choose GBK. Both UTF-16 and UTF-8 deal with Unicode encoding, and their encoding rules are not quite the same. Relatively speaking, UTF-16 is the most efficient encoding, easier to convert characters to bytes, and better for string manipulation. It is suitable for use between local disk and memory, and can quickly switch between characters and bytes. For example, the memory encoding of Java is encoded by UTF-16. But it is not suitable for transmission between networks, because network transmission is easy to damage the byte stream, and once the byte stream is damaged, it will be difficult to recover. In comparison, UTF-8 is more suitable for network transmission, using single-byte storage for ASCII characters, and single character damage will not affect other characters, which is between GBK and UTF-16 in coding efficiency, so UTF-8 makes a balance between coding efficiency and coding security. It is an ideal Chinese coding method.

This is the end of the introduction of "what are the knowledge points of Java string encoding". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report