Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the mechanism of saving strings in Java String?

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what is the mechanism of saving strings in Java String, which can be used for reference by interested friends. I hope you will gain a lot after reading this article, so let's take a look at it.

Is String really Immutable's?

The Unicode string in Java is saved in String as an array of byte according to the encoding format of Latin1 (when all characters are less than 0xFF) or UTF16:

Private final byte [] value

Generally speaking, Immutable means that final bytes will not be modified after String initialization, and all related operations of strings will not modify the original array but create a new copy.

But array elements can theoretically be modified, such as the following by changing the string constant abc to Abc by reflection:

Public static void main (String [] args) {setFirstValueToA ("abc"); String replaced = new String ("abc"); System.out.println (replaced); / / Abc} private static void setFirstValueToA (String str) {Class stringClass = String.class; try {Field value = stringClass.getDeclaredField ("value"); value.setAccessible (true) Byte [] bytes = (byte []) value.get (str); bytes [0] = 0x41; / / A} catch (NoSuchFieldException | IllegalAccessException e) {e.printStackTrace ();}} how to save a string array as a byte array

Test several string arrays with the following code:

Public static void main (String [] args) {printString ("abc"); printString ("Chinese"); printString ("abc Chinese"); printString ("abc");} private static void printString (String str) {System.out.println ("= >" + str); / / return the UTF-16 char [] size System.out.println ("length:" + str.length ()) / / Use default Encoding (UTF-8) System.out.println ("getBytes:" + str.getBytes () .length); / / Convert UTF-16 char [] to char System.out.println ("codePointCount:" + str.codePointCount (0, str.length ()); / / Get the UTF-16 char [] System.out.println ("toCharArray:" + str.toCharArray (). Length) / / The UTF-16 char [] to bytes System.out.println ("internal value:" + getStringInternalValueLength (str));}

The results are as follows:

Internal value

First, explain how the value field of String is calculated:

When all characters are less than 0xFF, Latin1 Character Encoding is used to save Unicode code point, that is, each character is saved with a byte. Like "ABC."

If the above conditions are not met, use UTF-16 Character Encoding to save, that is, each character is saved with 2 or 4 byte.

Unicode is a Coded Character Set that maps almost all human text to code point symbols, usually in the format of Usingxxxxmxxxmxxxx as hexadecimal integers, and the expression range is U+0000~U+10FFFF. Code point symbols are normalized tokens for text, but they are definitely saved as byte arrays when they are actually saved. These different storage methods are Character Encoding, such as UTF-8, and UTF-16, which is used internally in Java String.

UTF-16 is a way to encode the Unicode code point table into a character array. For U+0000~U+FFFF, it is saved directly according to 2 bytes (there is also the difference between big-end byte order and small-end byte order if subdivided); for U+10000~U+10FFFF, it will be converted into a pair of code point (surrogate pair) within the range of U+D800~U+DFFF, and then the two code point will be saved in accordance with the previous rules. This range is chosen because the Unicode interval has not yet been assigned valid characters, so it can be distinguished from the previous rules.

The Unicode code point of these two Chinese characters is U+4E2d and Ubun6587, which is larger than 0xFF, so the length of the saved byte is 4; there are characters that do not meet the conditions in "abc Chinese", so they are all saved with UTF-16, so they are all 2 byte, so the length is 10.

The Unicode code point of "☺" is U+1F60A. According to the UTF-16 specification, U+10000~U+10FFFF needs to be converted to surrogate pair and then saved to byte, converted to U+D83D, U+DE0A, so the byte length of "abc" is 10.

ToCharArray ()

The size of char in Java is 2 bytes, which is just enough to represent the Unicode symbol of a U+0000~U+FFFF.

When Latin1 is encoded, the char array is filled with the byte array, and when the high byte is 0th, the Unicode coded array is equivalent to the Unicode encoded array after surrogate pair, in which the surrogate characters are in the range of UTF.

Latin1 is encoded when "abc", so the size of the char array is equal to the size of the bytes array; when "abc Chinese", it is UTF-16 encoded, so the size of the char array is equal to half the size of the bytes array.

CodePointCount ()

The toCharArray method takes into account the converted surrogate pair, so the actual length may be greater than the character length. CodePointCount, on the other hand, can remove the influence of surrogate pair and return the initial character length, which counts two consecutive surrogate pair only once.

String.length

This method is the length of the toCharArray array, which is affected by surrogate pair and may be greater than the character length.

Str.getBytes () length

Inside String is a byte array saved by UTF-16 encoding. When returned through the getBytes method, you need to specify Encoding. By default, UTF-8 is used, so the byte array of UTF-16 is converted into an array of bytes of UTF-8, and each Unicode symbol is 1x 4 bytes long after UTF-8 encoding.

System.out.println ("abc" .getBytes (UTF_8) .length); / / 3 System.out.println ("medium" .getBytes (UTF_8) .length); / / 3 System.out.println ("text" .getBytes (UTF_8) .length); / / 3 System.out.println (".getBytes (UTF_8) .length) / / 4 Thank you for reading this article carefully. I hope the article "what is the mechanism of saving strings in Java String" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report