Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of coding Concepts such as ANSI,Unicode,BMP,UTF in Java

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article mainly introduces the example analysis of coding concepts such as ANSI,Unicode,BMP,UTF in Java, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

I. Preface

In fact, since I started writing Java code, I have encountered numerous problems of garbled code and transcoding, such as garbled code from text file reading to String, garbled code obtained from HTTP request parameters in Servlet, garbled data queried by JDBC, and so on. These problems are very common, and they can be solved successfully by random search when encountered, so there is no in-depth understanding.

Until two days ago, my classmate talked to me about the coding of a Java source file (this problem is analyzed in the last example), and a series of problems were pulled out from this problem, and then we discussed it while looking up the data, until late at night, we finally found a key clue in a blog post, solved all the doubts, and explained all the sentences that we did not understand before. So I decided to use this essay to record my understanding of some coding problems and the results of the experiment.

II. Summary of concepts

In the early days, the Internet was not developed, and computers were only used to process some local data, so many countries and regions designed coding schemes for their native languages, which are collectively referred to as ANSI codes (because they are all extensions of ANSI-ASCII codes). However, they did not discuss in advance how to be compatible with each other, but did it on their own, thus burying the scourge of coding conflicts. For example, there is a conflict between the GB2312 code used in the mainland and the Big5 code used in Taiwan. The same two bytes represent different characters in the two coding schemes. With the rise of the Internet, a document often contains multiple languages, and the computer has trouble displaying it. Because it doesn't know which encoding these two bytes belong to.

Such problems are common in the world, so there is a growing demand to redefine a universal character set and uniformly number all the characters in the world.

As a result, Unicode code arises at the historic moment, it uniformly numbers all characters in the world, and because it can uniquely identify a character, the font only needs to be designed for Unicode code. But the Unicode standard defines a character set, but does not specify the coding scheme, that is to say, it only defines an abstract number and its corresponding characters, but does not specify how to store a string of Unicode digits, and really specifies how to store UTF-8, UTF-16, UTF-32 and other schemes, so the codes with the beginning of UTF can be converted directly through calculation and Unicode values (CodePoint, code points). As the name implies, UTF-8 is 8-bit length as the basic unit coding, it is variable length coding, using 1 to 6 bytes to encode a character (because of the constraints of the Unicode range, so the actual maximum is only 4 bytes); UTF-16 is 16-bit basic unit coding, but also variable length coding, either 2 bytes or 4 bytes; UTF-32 is fixed length, fixed 4 bytes to store a Unicode number.

In fact, I used to have a misunderstanding about Unicode. In my impression, the maximum Unicode code can only be 0xFFFF, that is, it can only represent 2 ^ 16 characters at most. After a careful look at Wikipedia, I realized that the early UCS-2 coding scheme was really like this. UCS-2 always uses two bytes to encode one character, so it can only encode BMP (basic multilingual plane, that is, 0x0000-0xFFFF). Contains the most commonly used characters in the world). In order to encode characters whose Unicode is greater than 0xFFFF, people extend the UCS-2 code and create the UTF-16 code, which is variable in length. Within the BMP range, UTF-16 is exactly the same as UCS-2, while outside BMP, UTF-16 uses 4 bytes to store.

In order to facilitate the following description, let's first explain the concept of code unit (CodeUnit). The basic component unit of a certain code is called code unit. For example, the code unit of UTF-8 is 1 byte, and the code unit of UTF-16 is 2 bytes. It is difficult to explain, but it is easy to understand.

In order to be compatible with various languages and better cross-platform, what JavaString saves is the Unicode code of characters. It used to use the UCS-2 coding scheme to store Unicode, but later found that there are not enough characters in the BMP range, but for memory consumption and compatibility considerations, it did not rise to UCS-4 (that is, UTF-32, fixed 4-byte coding), but adopted the above-mentioned UTF-16,char type which can be regarded as its code unit. This practice causes some trouble. If all the characters are within the BMP range, if there are characters outside the BMP, it is no longer a code unit corresponding to a character. The length method returns the number of code units, not the number of characters. Naturally, the charAt method returns a code unit rather than a character, and it becomes troublesome to traverse, although it provides some new operation methods. After all, it's still inconvenient, and it can't be accessed randomly.

In addition, I found that Java will not handle the literal amount of Unicode larger than 0xFFFF when compiling, so if you can't type a non-BMP character, but you know its Unicode code, you have to use a stupid way to let String store it: manually calculate the character's UTF-16 code (four bytes), the first two bytes and the last two bytes each as a Unicode number, and then assign a value to String, the sample code is shown below.

Public static void main (String [] args) {/ / String str = ""; / / We want to assign such a character, assuming that my input method can't type / / but I know its Unicode is 0x1D11E / / String str = "\ u1D11E"; / / it won't be recognized / / so its UTF-16 code D834 DD1E String str = "\ uD834\ uDD1E" is calculated. / / then write System.out.println (str); / / successfully output ""}

The notepad that comes with the Windows system can be saved as Unicode code, which actually refers to the UTF-16 code. As mentioned above, the main character encodings used are in the BMP range, while in the BMP range, the UTF-16 coding value of each character is equal to the corresponding Unicode value, which is probably why Microsoft calls it Unicode. For example, I type in notepad "good a" two characters, and then save as Unicode big endian (high order first) encoding, use WinHex to open the file, the contents of the following figure, the first two bytes of the file is called Byte Order Mark (byte order mark), (FE FF) identification byte order is high priority, and then (597D) is the "good" Unicode code, (0061) is the "a" Unicode code.

With Unicode code, the problem can not be solved immediately, because first of all, there are a large number of non-Unicode standard encoded data in the world, and it is impossible for us to discard them. Secondly, Unicode coding often takes up more space than ANSI coding, so from the point of view of saving resources, ANSI coding is still necessary. Therefore, it is necessary to establish a conversion mechanism so that ANSI coding can be converted to Unicode for unified processing, and Unicode can also be converted to ANSI coding to meet the requirements of the platform.

The conversion method is easy to say. Compatible codes such as UTF series or ISO-8859-1 can be converted directly by calculation and Unicode values (in fact, it may also be a table lookup), while the ANSI coding left over by the system can only be carried out by looking up the table. Microsoft calls this mapping table CodePage (code page) and classifies and numbers according to the code. For example, our common cp936 is the code page of GBK. Cp65001 is the code page of UTF-8. The following figure shows the GBK- > Unicode mapping table found on Microsoft's official website (visually incomplete). Similarly, there should be a reverse Unicode- > GBK mapping table.

With the code page, you can easily carry out a variety of code conversion, such as conversion from GBK to UTF-8, only need to first divide the data by characters according to the coding rules of GBK, use the coding data of each character to check the GBK code page, get its Unicode value, and then use the Unicode to check the UTF-8 code page (or direct calculation), you can get the corresponding UTF-8 code. The reverse is the same. Note: UTF-8 is a standard implementation of Unicode, and its code page contains all the Unicode values, so there is no loss when any encoding is converted to UTF-8 and then back. At this point, we can draw a conclusion that to complete the transcoding work, the most important step is to successfully convert to Unicode, so the correct selection of the character set (code page) is the key.

After understanding the nature of the transcoding loss problem, I suddenly understand why JSP's framework uses ISO-8859-1 to decode HTTP request parameters, resulting in a statement like this when we get Chinese parameters:

Stringparam=newString (s.getBytes ("iso-8859-1"), "UTF-8")

Because the JSP framework receives a binary byte stream of parameter encoding, it doesn't know what kind of encoding it is (or doesn't care), and it doesn't know which code page to look up to convert to Unicode. Then it chooses a scheme that will never be lost, it assumes that this is ISO-8859-1 encoded data, and then looks up the code page of ISO-8859-1 to get the Unicode sequence, because ISO-8859-1 is encoded in bytes, and unlike ASCII, it encodes every bit in space 0,255, so any byte can find the corresponding Unicode in its code page. If you switch back to the original byte stream from Unicode, there will be no loss. It does so, for European and American programmers who do not consider other languages, you can directly use the JSP framework to decode the String, and to be compatible with other languages, you just need to go back to the original byte stream and decode it with the actual code page.

I have finished explaining the concepts of Unicode and character encoding, and then use an example of Java to feel it.

III. Case analysis

1. Convert to Unicode--String constructor

The construction method of String is to convert all kinds of encoded data to Unicode sequence (stored in UTF-16 code). The following test code is used to demonstrate the application of JavaString construction method. Non-BMP characters are not involved in the example, so the codePointAt methods are not used.

Public class Test {public static void main (String [] args) throws IOException {/ / "Hello" GBK encoded data byte [] gbkData = {(byte) 0xc4, (byte) 0xe3, (byte) 0xba, (byte) 0xc3} / / BIG5 encoded data byte [] big5Data = {(byte) 0xa7, (byte) 0x41, (byte) 0xa6, (byte) 0x6e}; / / construct String and decode it to UnicodeString strFromGBK = new String (gbkData, "GBK"); String strFromBig5 = new String (big5Data, "BIG5"); / / output Unicode sequence showUnicode (strFromGBK); showUnicode (strFromBig5);} public static void showUnicode (String str) {for (int I = 0; I) respectively

< str.length(); i++) { System.out.printf("\\u%x", (int)str.charAt(i));}System.out.println();}} 运行结果如下图

It can be found that because String has mastered the Unicode code, it needs to be converted to another encoding soeasy!

3. Using Unicode as a bridge to realize code conversion

With the foundation of the above two parts, it is very easy to achieve code conversion, just use them together. First, newString converts the original encoded data into Unicode sequences, and then calls getBytes to transfer to the specified encoding on OK.

For example, a very simple GBK to Big5 conversion code is as follows

Public static void main (String [] args) throws UnsupportedEncodingException {/ / assume that this is data read from a file in a byte stream (GBK encoding) byte [] gbkData = {(byte) 0xc4, (byte) 0xe3, (byte) 0xba, (byte) 0xc3}; / / convert to UnicodeString tmp = new String (gbkData, "GBK"); / / convert from Unicode to Big5 coding byte [] big5Data = tmp.getBytes ("Big5"); / / follow up. }

4. Code loss problem

As explained above, the JSP framework uses the ISO-8859-1 character set to decode. First, an example is used to simulate the reduction process. The code is as follows

Public class Test {public static void main (String [] args) throws UnsupportedEncodingException {/ / JSP framework receives 6 bytes of data byte [] data = {(byte) 0xe4, (byte) 0xbd, (byte) 0xa0, (byte) 0xe5, (byte) 0xa5, (byte) 0xbd}; / / print raw data showBytes (data) / / the JSP framework assumes that it is encoded by ISO-8859-1 and generates a String object String tmp = new String (data, "ISO-8859-1") / / * the end of the JSP framework * / / when the developer gets it, it prints out that it is 6 European characters instead of the expected "Hello" System.out.println ("result of ISO decoding:" + tmp). / / so the first step is to get the original 6-byte data (check the code page of ISO-8859-1) byte [] utfData = tmp.getBytes ("ISO-8859-1"); / / print the restored data showBytes (utfData) / / the developer knows that it is UTF-8-encoded, so use the code page of UTF-8 to reconstruct the String object String result = new String (utfData, "UTF-8"); / / print again, correct! System.out.println ("result of UTF-8 decoding:" + result);} public static void showBytes (byte [] data) {for (byte b: data) System.out.printf ("0x%x", b); System.out.println ();}}

The running result is as follows, the first output is incorrect, because the decoding rules are wrong, and the wrong code page is checked, and the wrong Unicode is obtained. Then found that through the wrong Unicode reverse check ISO-8859-1 code page can also perfectly restore the data.

This is not the point, the point is that if you replace "medium" with "China", the compilation will be successful, and the running result is shown in the following figure. In addition, it can be found that the compilation fails when the number of Chinese characters is odd and even when it is passed. Why is that? Let's analyze it in detail.

Because JavaString uses Unicode internally, the compiler transcodes the literal amount of our string at compile time, converting the encoding of the source file to Unicode (Wikipedia says it uses a slightly different encoding from UTF-8). When compiling, we did not specify the encoding parameter, so the compiler will decode it in GBK by default. Those who know a little about UTF-8 and GBK should know that generally, it takes 3 bytes to encode a Chinese character using UTF-8, while GBK only needs 2 bytes, which can explain why the parity of the number of characters will affect the result, because if there are 2 characters, UTF-8 encoding takes up 6 bytes. Decoding in GBK mode can be decoded to exactly 3 characters, but if it is 1 character, there will be an extra byte that cannot be mapped, which is the question mark in the figure.

To be more specific, the UTF-8 encoding of the word "China" in the source file is e4b8ade59bbd, and the compiler decodes it in GBK mode. Three pairs of bytes are checked for cp936 to get three Unicode values, which are 6d93e15e6d57, corresponding to the three strange characters in the result graph. As shown in the figure below, the compiled three Unicode are actually stored as UTF-8-like codes in .class files. When running, Unicode is stored in JVM. However, when the final output is output, it will be passed to the terminal after coding. This time, the agreed coding is the coding of the system locale, so if the terminal coding setting is changed, it will still be garbled. Our e15e does not define the corresponding characters in the Unicode standard, so it will be displayed differently in different fonts on different platforms.

As you can imagine, if, on the contrary, the source file is stored in GBK encoding, and then deceives the compiler into saying that it is UTF-8, it is basically impossible to compile and pass no matter how many Chinese characters are entered, because the coding of UTF-8 is very regular, and random combination of bytes will not comply with UTF-8 coding rules.

Of course, the most direct way to get the compiler to correctly convert the code to Unicode is to honestly tell the compiler what the encoding of the source file is.

Thank you for reading this article carefully. I hope the article "sample Analysis of coding concepts such as ANSI,Unicode,BMP,UTF in Java" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report