How to improve the performance of Java string Encoding and Decoding 07/13 Update SLTechnology News&Howtos

How to improve the performance of Java string Encoding and Decoding

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

Most people do not understand the knowledge points of this article "how to improve the performance of Java string encoding and decoding", so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to improve the performance of Java string encoding and decoding" article.

1. Common string encoding

Common string encodings are:

LATIN1 can only save ASCII characters, also known as ISO-8859-1.

UTF-8 variable length byte encoding, a character needs to be represented by 1, 2, or 3 byte. Because Chinese usually needs 3 bytes to represent, Chinese scene UTF-8 coding usually needs more space, and the alternative is GBK/GB2312/GB18030.

UTF-16 2 bytes, a character needs to be represented by 2 byte, also known as UCS-2 (2-byte Universal Character Set). According to the distinction between large and small ends, UTF-16 comes in two forms, UTF-16BE and UTF-16LE, and the default UTF-16 refers to UTF-16BE. Char in the Java language is UTF-16LE encoding.

GB18030 variable length byte encoding, a character needs to be represented by 1, 2, or 3 byte. Similar to UTF8, Chinese only needs 2 characters, indicating that Chinese is more economical in byte size, but the disadvantage is that it is not commonly used internationally.

For ease of calculation, characters of equal width are usually used for strings in memory, and UTF-16 is used for char in Java and char in .NET. Early Windows-NT only supported UTF-16.

two。 Transcoding performance

The conversion between UTF-16 and UTF-8 is complex and usually has poor performance.

The following is an implementation of converting UTF-16 to UTF-8 encoding. You can see that the algorithm is complex, so the performance is poor, and this operation cannot be optimized using vector API.

Static int encodeUTF8 (char [] utf16, int off, int len, byte [] dest, int dp) {int sl = off + len, last_offset = sl-1; while (off

< sl) { char c = utf16[off++]; if (c < 0x80) { // Have at most seven bits dest[dp++] = (byte) c; } else if (c < 0x800) { // 2 dest, 11 bits dest[dp++] = (byte) (0xc0 | (c >

> 6); dest [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c

< '\uE000') { int uc; if (c < '\uDC00') { if (off >

Last_offset) {dest [dp++] = (byte)'?'; return dp;} char d = utf16 [off]; if (d > ='\ uDC00' & & d

< '\uE000') { uc = (c >

); dest [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dest [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dest [dp++] = (byte) (0x80 | (uc & 0x3f)); off++ / / 2 utf16} else {/ / 3 dest, 16 bits dest [dp++] = (byte) (0xe0 | (c > > 12); dest [dp++] = (byte) (0x80 | (c > > 6) & 0x3f)); dest [dp++] = (byte) (0x80 | (c & 0x3f));}} return dp;}

Because char in Java is UTF-16LE-encoded, you can use the sun.misc.Unsafe#copyMemory method to quickly copy if you need to convert char [] to UTF-16LE-encoded byte []. For example:

Static int writeUtf16LE (char [] chars, int off, int len, byte [] dest, final int dp) {UNSAFE.copyMemory (chars, CHAR_ARRAY_BASE_OFFSET + off * 2, dest, BYTE_ARRAY_BASE_OFFSET + dp, len * 2); dp + = len * 2; return dp;} 3.Java String coding

Different versions of JDK String have different implementations, resulting in different performance. Char is UTF-16 encoding, but String can have LATIN1 encoding internally after JDK 9.

3.1. String implementation of static class String {final char [] value; final int offset; final int count;} before JDK 6

Before Java 6, the String object generated by the String.subString method shared a char [] value with the original String object, which caused the char [] of the String returned by the subString method to be referenced and not recycled by GC. As a result, many libraries avoid using the subString method for JDK 6 and below.

3.2. The String implementation of static class String {final char [] value;} in JDK 7pet8

After JDK 7, the string removes the offset and count fields, and value.length is the original count. This avoids the problem of subString referencing large char [], and it is easier to optimize, so the performance of String operations in JDK7/8 is much better than that of Java 6.

3.3. JDK's 9-10-11 implementation static class String {final byte code; final byte [] value; static final byte LATIN1 = 0; static final byte UTF16 = 1;}

After JDK 9, the value type changes from char [] to byte [], adding a field code, using value to encode LATIN if all characters are ASCII characters, or UTF16 if there is any non-ASCII character. This mixed coding method makes English scenes take up less memory. The disadvantage is that the String API performance of Java 9 may not be as good as that of JDK 8, especially the input char [] construction string, which will be compressed to latin-encoded byte [], and will be reduced by 10% in some scenarios.

4. The method of constructing string quickly

In order to realize that the string is immutable, there will be a copy process when constructing the string. If you want to increase the cost of constructing the string, you should avoid such a copy.

For example, the following is the implementation of a constructor for JDK8's String

Public final class String {public String (char value []) {this.value = Arrays.copyOf (value, value.length);}}

In JDK8, there is a constructor that is not copied, but this method is not public. You need to use a trick to implement MethodHandles.Lookup & LambdaMetafactory binding reflection to call it. There is code to introduce this technique later in the article.

Public final class String {String (char [] value, boolean share) {/ / assert share: "unshared not supported"; this.value = value;}}

There are three ways to construct characters quickly:

Use MethodHandles.Lookup & LambdaMetafactory to bind reflection

Related methods of using JavaLangAccess

Construct directly using Unsafe

Of the three methods, 1 and 2 perform almost the same, 3 is slightly slower than 1 and 2, but are much faster than direct new strings. The data tested by JDK8 using JMH are as follows:

Benchmark Mode Cnt Score Error Units

StringCreateBenchmark.invoke thrpt 5 784869.350 ±1936.754 ops/ms

StringCreateBenchmark.langAccess thrpt 5 784029.186 ±2734.300 ops/ms

StringCreateBenchmark.unsafe thrpt 5 761176.319 ±11914.549 ops/ms

StringCreateBenchmark.newString thrpt 5 140883.533 ±2217.773 ops/ms

After JDK 9, direct construction can achieve better results for scenes with all ASCII characters.

4.1 Fast string construction method based on MethodHandles.Lookup & LambdaMetafactory binding reflection

4.1.1 JDK8 Quick Construction string

Public static BiFunction getStringCreatorJDK8 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, int.class); constructor.setAccessible (true); MethodHandles lookup = constructor.newInstance (String.class,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class) MethodHandle handle = caller.findConstructor (String.class, MethodType.methodType (void.class, char [] .class, boolean.class)) CallSite callSite = LambdaMetafactory.metafactory (caller, "apply", MethodType.methodType (BiFunction.class), handle.type (). Generic (), handle, handle.type (); return (BiFunction) callSite.getTarget (). InvokeExact ();}

4.1.2 the method of quickly constructing string in JDK 11

Public static ToIntFunction getStringCode11 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, int.class); constructor.setAccessible (true); MethodHandles.Lookup lookup = constructor.newInstance (String.class,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class); MethodHandle handle = caller.findVirtual (String.class, "coder", MethodType.methodType (byte.class)) CallSite callSite = LambdaMetafactory.metafactory (caller, "applyAsInt", MethodType.methodType (ToIntFunction.class), MethodType.methodType (int.class, Object.class), handle, handle.type (); return (ToIntFunction) callSite.getTarget (). InvokeExact ();} if (JDKUtils.JVM_VERSION = = 11) {Function stringCreator = JDKUtils.getStringCreatorJDK11 () Byte [] bytes = new byte [] {'await,' baked,'c'}; String apply = stringCreator.apply (bytes); assertEquals ("abc", apply);}

4.1.3 the method of quickly constructing string in JDK 17

In JDK 17, MethodHandles.Lookup uses Reflection.registerFieldsToFilter to protect lookupClass and allowedModes, and the methods found on the Internet by modifying allowedModes are not available.

In JDK 17, you have to configure the JVM startup parameters to use MethodHandlers. As follows:

-- add-opens java.base/java.lang.invoke=ALL-UNNAMEDpublic static BiFunction getStringCreatorJDK17 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, Class.class, int.class); constructor.setAccessible (true); MethodHandles.Lookup lookup = constructor.newInstance (String.class, null,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class) MethodHandle handle = caller.findStatic (String.class, "newStringNoRepl1", MethodType.methodType (String.class, byte [] .class, Charset.class); CallSite callSite = LambdaMetafactory.metafactory (caller, "apply", MethodType.methodType (BiFunction.class), handle.type (). Generic (), handle, handle.type ()) Return (BiFunction) callSite.getTarget (). InvokeExact ();} if (JDKUtils.JVM_VERSION = = 17) {BiFunction stringCreator = JDKUtils.getStringCreatorJDK17 (); byte [] bytes = new byte [] {'a', 'baked,' c'}; String apply = stringCreator.apply (bytes, StandardCharsets.US_ASCII); assertEquals ("abc", apply);} 4.2 Quick Construction based on JavaLangAccess

Through the JavaLangAccess provided by SharedSecrets, it is also possible not to copy the construction string, but this is troublesome. The API of JDK on 8-11-17 is different, so it is not convenient for a set of code to be compatible with different versions of JDK, so it is not recommended.

JavaLangAccess javaLangAccess = SharedSecrets.getJavaLangAccess (); javaLangAccess.newStringNoRepl (b, StandardCharsets.US_ASCII); 4.3 Rapid Construction of string public static final Unsafe UNSAFE;static {Unsafe unsafe = null; try {Field theUnsafeField = Unsafe.class.getDeclaredField ("theUnsafe"); theUnsafeField.setAccessible (true); unsafe = (Unsafe) theUnsafeField.get (null);} catch (Throwable ignored) {} UNSAFE = unsafe } / Object str = UNSAFE.allocateInstance (String.class); UNSAFE.putObject (str, valueOffset, chars)

Note: after JDK 9, the implementation is different, such as:

Object str = UNSAFE.allocateInstance (String.class); UNSAFE.putByte (str, coderOffset, (byte) 0); UNSAFE.putObject (str, valueOffset, (byte []) bytes); 4.4 techniques for quickly building strings:

The following method formats the date into a string, and the performance will be very good.

Public String formatYYYYMMDD (Calendar calendar) throws Throwable {int year = calendar.get (Calendar.YEAR); int month = calendar.get (Calendar.MONTH) + 1; int dayOfMonth = calendar.get (Calendar.DAY_OF_MONTH); byte y0 = (byte) (year / 1000 +'0'); byte y1 = (byte) ((year / 1000)% 10 +'0'); byte y2 = (byte) ((year / 10)% 10 +'0') Byte Y3 = (byte) (year% 10 +'0'); byte M0 = (byte) (month / 10 +'0'); byte M1 = (byte) (month% 10 +'0'); byte D0 = (byte) (dayOfMonth / 10 +'0'); byte D1 = (byte) (dayOfMonth% 10 + 0') If (JDKUtils.JVM_VERSION > = 9) {byte [] bytes = new byte [] {y0, y1, y2, y3, M0, M1, d0, D1}; if (JDKUtils.JVM_VERSION = = 17) {return JDKUtils.getStringCreatorJDK17 (). Apply (bytes, StandardCharsets.US_ASCII);} if (JDKUtils.JVM_VERSION = value.length)) {throw new StringIndexOutOfBoundsException (index) } return value [index];}}

After JDK 9, charAt is more expensive

Public final class String {private final byte [] value; private final byte coder; public char charAt (int index) {if (isLatin1 ()) {return StringLatin1.charAt (value, index);} else {return StringUTF16.charAt (value, index);} the method of obtaining String.value

The methods to obtain String.value are as follows:

Use Field reflection

Use Unsafe

The comparison data of Unsafe and Field reflection on JDK 8 JMH are as follows:

Benchmark Mode Cnt Score Error Units

StringGetValueBenchmark.reflect thrpt 5 438374.685 ±1032.028 ops/ms

StringGetValueBenchmark.unsafe thrpt 5 1302654.150 ±59169.706 ops/ms

5.1.1 using reflection to get String.value

Static Field valueField;static {try {valueField = String.class.getDeclaredField ("value"); valueField.setAccessible (true);} catch (NoSuchFieldException ignored) {}} / char [] chars = (char []) valueField.get (str)

5.1.2 use Unsafe to get String.value

Static long valueFieldOffset;static {try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UNSAFE.objectFieldOffset (valueField);} catch (NoSuchFieldException ignored) {}} / char [] chars = (char []) UNSAFE.getObject (str, valueFieldOffset); static long valueFieldOffset;static long coderFieldOffset Static {try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UNSAFE.objectFieldOffset (valueField); Field coderField = String.class.getDeclaredField ("coder"); coderFieldOffset = UNSAFE.objectFieldOffset (coderField) } catch (NoSuchFieldException ignored) {}} / byte coder = UNSAFE.getObject (str, coderFieldOffset); byte [] bytes = (byte []) UNSAFE.getObject (str, valueFieldOffset); 6. Faster encodeUTF8 method

When you can get String.value directly, you can encodeUTF8 it directly, which is much better than String.getBytes (StandardCharsets.UTF_8).

6.1 public static int encodeUTF8 (char [] src, int offset, int len, byte [] dst, int dp) of JDK8 High performance encodeUTF8 {int sl = offset + len; int dlASCII = dp + Math.min (len, dst.length); / / ASCII only optimized loop while (dp)

< dlASCII && src[offset] < '\u0080') { dst[dp++] = (byte) src[offset++]; } while (offset < sl) { char c = src[offset++]; if (c < 0x80) { // Have at most seven bits dst[dp++] = (byte) c; } else if (c < 0x800) { // 2 bytes, 11 bits dst[dp++] = (byte) (0xc0 | (c >

> 6); dst [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c

< ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >

='\ uD800' & & c

< ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { char d = src[ip + 1]; // d >

='\ uDC00' & & d

< ('\uDFFF' + 1) if (d >

='\ uDC00' & & d

< ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c >

); dst [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc & 0x3f)); offset++ / / 2 chars}} else {/ / 3 bytes, 16 bits dst [dp++] = (byte) (0xe0 | (c > > 12); dst [dp++] = (byte) (0x80 | (c > > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (c & 0x3f));}} return dp;}

An example of using encodeUTF8 method

Char [] chars = UNSAFE.getObject (str, valueFieldOffset); / / ensureCapacity (chars.length * 3) byte [] bytes =...; / / int bytesLength = IOUtils.encodeUTF8 (chars, 0, chars.length, bytes, bytesOffset)

In this way, there will be no extra arrayCopy operations for encodeUTF8 operations, and the performance will be improved.

6.1.1 performance test comparison

Test code

Public class EncodeUTF8Benchmark {static String STR = "01234567890ABCDEFGHIJKLMNOPQRSTUVZZZZabcdefghijklmnopqrstuvwzyz1234567890"; static byte [] out; static long valueFieldOffset; static {out = new byte [STR.length () * 3]; try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UnsafeUtils.UNSAFE.objectFieldOffset (valueField);} catch (NoSuchFieldException e) {e.printStackTrace () } @ Benchmark public void unsafeEncodeUTF8 () throws Exception {char [] chars = (char []) UnsafeUtils.UNSAFE.getObject (STR, valueFieldOffset); int len = IOUtils.encodeUTF8 (chars, 0, chars.length, out, 0);} @ Benchmark public void getBytesUTF8 () throws Exception {byte [] bytes = STR.getBytes (StandardCharsets.UTF_8); System.arraycopy (bytes, 0, out, 0, bytes.length) } public static void main (String [] args) throws RunnerException {Options options = new OptionsBuilder () .include (EncodeUTF8Benchmark.class.getName ()) .mode (Mode.Throughput) .timeUnit (TimeUnit.MILLISECONDS) .forks (1) .build (); new Runner (options). Run ();}}

Test result

EncodeUTF8Benchmark.getBytesUTF8 thrpt 5 20690.960 ±5431.442 ops/ms

EncodeUTF8Benchmark.unsafeEncodeUTF8 thrpt 5 34508.606 ±55.510 ops/ms

As a result, the coding overhead of calling the encodeUTF8 method directly through unsafe + is 58% of that of newStringUTF8.

6.2 method of JDK9/11/17 High performance encodeUTF8 public static int encodeUTF8 (byte [] src, int offset, int len, byte [] dst, int dp) {int sl = offset + len; while (offset)

< sl) { byte b0 = src[offset++]; byte b1 = src[offset++]; if (b1 == 0 && b0 >

= 0) {dst [dp++] = b0;} else {char c = (char) (b0 & 0xff) 6)); dst [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c

< ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >

='\ uD800' & & c

< ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { b0 = src[ip + 1]; b1 = src[ip + 2]; char d = (char) (((b0 & 0xff) = '\uDC00' && d < ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c >

); dst [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc & 0x3f)); offset++ / / 2 chars}} else {/ / 3 bytes, 16 bits dst [dp++] = (byte) (0xe0 | (c > > 12); dst [dp++] = (byte) (0x80 | (c > > 6) & 0x3f); dst [dp++] = (byte) (0x80 | (c & 0x3f)) } return dp;}

An example of using encodeUTF8 method

Byte coder = UNSAFE.getObject (str, coderFieldOffset); byte [] value = UNSAFE.getObject (str, coderFieldOffset); if (coder = = 0) {/ / ascii arraycopy} else {/ / ensureCapacity (chars.length * 3) byte [] bytes =...; / / int bytesLength = IOUtils.encodeUTF8 (value, 0, value.length, bytes, bytesOffset);}

In this way, there will be no extra arrayCopy operations for encodeUTF8 operations, and the performance will be improved.

The above is about the content of this article on "how to improve the performance of Java string encoding and decoding". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.