In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
Most people do not understand the knowledge points of this article "how to improve the performance of Java string encoding and decoding", so the editor summarizes the following contents, detailed contents, clear steps, and has a certain reference value. I hope you can get something after reading this article. Let's take a look at this "how to improve the performance of Java string encoding and decoding" article.
1. Common string encoding
Common string encodings are:
LATIN1 can only save ASCII characters, also known as ISO-8859-1.
UTF-8 variable length byte encoding, a character needs to be represented by 1, 2, or 3 byte. Because Chinese usually needs 3 bytes to represent, Chinese scene UTF-8 coding usually needs more space, and the alternative is GBK/GB2312/GB18030.
UTF-16 2 bytes, a character needs to be represented by 2 byte, also known as UCS-2 (2-byte Universal Character Set). According to the distinction between large and small ends, UTF-16 comes in two forms, UTF-16BE and UTF-16LE, and the default UTF-16 refers to UTF-16BE. Char in the Java language is UTF-16LE encoding.
GB18030 variable length byte encoding, a character needs to be represented by 1, 2, or 3 byte. Similar to UTF8, Chinese only needs 2 characters, indicating that Chinese is more economical in byte size, but the disadvantage is that it is not commonly used internationally.
For ease of calculation, characters of equal width are usually used for strings in memory, and UTF-16 is used for char in Java and char in .NET. Early Windows-NT only supported UTF-16.
two。 Transcoding performance
The conversion between UTF-16 and UTF-8 is complex and usually has poor performance.
The following is an implementation of converting UTF-16 to UTF-8 encoding. You can see that the algorithm is complex, so the performance is poor, and this operation cannot be optimized using vector API.
Static int encodeUTF8 (char [] utf16, int off, int len, byte [] dest, int dp) {int sl = off + len, last_offset = sl-1; while (off
< sl) { char c = utf16[off++]; if (c < 0x80) { // Have at most seven bits dest[dp++] = (byte) c; } else if (c < 0x800) { // 2 dest, 11 bits dest[dp++] = (byte) (0xc0 | (c >> 6); dest [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c
< '\uE000') { int uc; if (c < '\uDC00') { if (off >Last_offset) {dest [dp++] = (byte)'?'; return dp;} char d = utf16 [off]; if (d > ='\ uDC00' & & d
< '\uE000') { uc = (c >); dest [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dest [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dest [dp++] = (byte) (0x80 | (uc & 0x3f)); off++ / / 2 utf16} else {/ / 3 dest, 16 bits dest [dp++] = (byte) (0xe0 | (c > > 12); dest [dp++] = (byte) (0x80 | (c > > 6) & 0x3f)); dest [dp++] = (byte) (0x80 | (c & 0x3f));}} return dp;}
Because char in Java is UTF-16LE-encoded, you can use the sun.misc.Unsafe#copyMemory method to quickly copy if you need to convert char [] to UTF-16LE-encoded byte []. For example:
Static int writeUtf16LE (char [] chars, int off, int len, byte [] dest, final int dp) {UNSAFE.copyMemory (chars, CHAR_ARRAY_BASE_OFFSET + off * 2, dest, BYTE_ARRAY_BASE_OFFSET + dp, len * 2); dp + = len * 2; return dp;} 3.Java String coding
Different versions of JDK String have different implementations, resulting in different performance. Char is UTF-16 encoding, but String can have LATIN1 encoding internally after JDK 9.
3.1. String implementation of static class String {final char [] value; final int offset; final int count;} before JDK 6
Before Java 6, the String object generated by the String.subString method shared a char [] value with the original String object, which caused the char [] of the String returned by the subString method to be referenced and not recycled by GC. As a result, many libraries avoid using the subString method for JDK 6 and below.
3.2. The String implementation of static class String {final char [] value;} in JDK 7pet8
After JDK 7, the string removes the offset and count fields, and value.length is the original count. This avoids the problem of subString referencing large char [], and it is easier to optimize, so the performance of String operations in JDK7/8 is much better than that of Java 6.
3.3. JDK's 9-10-11 implementation static class String {final byte code; final byte [] value; static final byte LATIN1 = 0; static final byte UTF16 = 1;}
After JDK 9, the value type changes from char [] to byte [], adding a field code, using value to encode LATIN if all characters are ASCII characters, or UTF16 if there is any non-ASCII character. This mixed coding method makes English scenes take up less memory. The disadvantage is that the String API performance of Java 9 may not be as good as that of JDK 8, especially the input char [] construction string, which will be compressed to latin-encoded byte [], and will be reduced by 10% in some scenarios.
4. The method of constructing string quickly
In order to realize that the string is immutable, there will be a copy process when constructing the string. If you want to increase the cost of constructing the string, you should avoid such a copy.
For example, the following is the implementation of a constructor for JDK8's String
Public final class String {public String (char value []) {this.value = Arrays.copyOf (value, value.length);}}
In JDK8, there is a constructor that is not copied, but this method is not public. You need to use a trick to implement MethodHandles.Lookup & LambdaMetafactory binding reflection to call it. There is code to introduce this technique later in the article.
Public final class String {String (char [] value, boolean share) {/ / assert share: "unshared not supported"; this.value = value;}}
There are three ways to construct characters quickly:
Use MethodHandles.Lookup & LambdaMetafactory to bind reflection
Related methods of using JavaLangAccess
Construct directly using Unsafe
Of the three methods, 1 and 2 perform almost the same, 3 is slightly slower than 1 and 2, but are much faster than direct new strings. The data tested by JDK8 using JMH are as follows:
Benchmark Mode Cnt Score Error Units
StringCreateBenchmark.invoke thrpt 5 784869.350 ±1936.754 ops/ms
StringCreateBenchmark.langAccess thrpt 5 784029.186 ±2734.300 ops/ms
StringCreateBenchmark.unsafe thrpt 5 761176.319 ±11914.549 ops/ms
StringCreateBenchmark.newString thrpt 5 140883.533 ±2217.773 ops/ms
After JDK 9, direct construction can achieve better results for scenes with all ASCII characters.
4.1 Fast string construction method based on MethodHandles.Lookup & LambdaMetafactory binding reflection
4.1.1 JDK8 Quick Construction string
Public static BiFunction getStringCreatorJDK8 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, int.class); constructor.setAccessible (true); MethodHandles lookup = constructor.newInstance (String.class,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class) MethodHandle handle = caller.findConstructor (String.class, MethodType.methodType (void.class, char [] .class, boolean.class)) CallSite callSite = LambdaMetafactory.metafactory (caller, "apply", MethodType.methodType (BiFunction.class), handle.type (). Generic (), handle, handle.type (); return (BiFunction) callSite.getTarget (). InvokeExact ();}
4.1.2 the method of quickly constructing string in JDK 11
Public static ToIntFunction getStringCode11 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, int.class); constructor.setAccessible (true); MethodHandles.Lookup lookup = constructor.newInstance (String.class,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class); MethodHandle handle = caller.findVirtual (String.class, "coder", MethodType.methodType (byte.class)) CallSite callSite = LambdaMetafactory.metafactory (caller, "applyAsInt", MethodType.methodType (ToIntFunction.class), MethodType.methodType (int.class, Object.class), handle, handle.type (); return (ToIntFunction) callSite.getTarget (). InvokeExact ();} if (JDKUtils.JVM_VERSION = = 11) {Function stringCreator = JDKUtils.getStringCreatorJDK11 () Byte [] bytes = new byte [] {'await,' baked,'c'}; String apply = stringCreator.apply (bytes); assertEquals ("abc", apply);}
4.1.3 the method of quickly constructing string in JDK 17
In JDK 17, MethodHandles.Lookup uses Reflection.registerFieldsToFilter to protect lookupClass and allowedModes, and the methods found on the Internet by modifying allowedModes are not available.
In JDK 17, you have to configure the JVM startup parameters to use MethodHandlers. As follows:
-- add-opens java.base/java.lang.invoke=ALL-UNNAMEDpublic static BiFunction getStringCreatorJDK17 () throws Throwable {Constructor constructor = MethodHandles.Lookup.class.getDeclaredConstructor (Class.class, Class.class, int.class); constructor.setAccessible (true); MethodHandles.Lookup lookup = constructor.newInstance (String.class, null,-1 / / Lookup.TRUSTED); MethodHandles.Lookup caller = lookup.in (String.class) MethodHandle handle = caller.findStatic (String.class, "newStringNoRepl1", MethodType.methodType (String.class, byte [] .class, Charset.class); CallSite callSite = LambdaMetafactory.metafactory (caller, "apply", MethodType.methodType (BiFunction.class), handle.type (). Generic (), handle, handle.type ()) Return (BiFunction) callSite.getTarget (). InvokeExact ();} if (JDKUtils.JVM_VERSION = = 17) {BiFunction stringCreator = JDKUtils.getStringCreatorJDK17 (); byte [] bytes = new byte [] {'a', 'baked,' c'}; String apply = stringCreator.apply (bytes, StandardCharsets.US_ASCII); assertEquals ("abc", apply);} 4.2 Quick Construction based on JavaLangAccess
Through the JavaLangAccess provided by SharedSecrets, it is also possible not to copy the construction string, but this is troublesome. The API of JDK on 8-11-17 is different, so it is not convenient for a set of code to be compatible with different versions of JDK, so it is not recommended.
JavaLangAccess javaLangAccess = SharedSecrets.getJavaLangAccess (); javaLangAccess.newStringNoRepl (b, StandardCharsets.US_ASCII); 4.3 Rapid Construction of string public static final Unsafe UNSAFE;static {Unsafe unsafe = null; try {Field theUnsafeField = Unsafe.class.getDeclaredField ("theUnsafe"); theUnsafeField.setAccessible (true); unsafe = (Unsafe) theUnsafeField.get (null);} catch (Throwable ignored) {} UNSAFE = unsafe } / Object str = UNSAFE.allocateInstance (String.class); UNSAFE.putObject (str, valueOffset, chars)
Note: after JDK 9, the implementation is different, such as:
Object str = UNSAFE.allocateInstance (String.class); UNSAFE.putByte (str, coderOffset, (byte) 0); UNSAFE.putObject (str, valueOffset, (byte []) bytes); 4.4 techniques for quickly building strings:
The following method formats the date into a string, and the performance will be very good.
Public String formatYYYYMMDD (Calendar calendar) throws Throwable {int year = calendar.get (Calendar.YEAR); int month = calendar.get (Calendar.MONTH) + 1; int dayOfMonth = calendar.get (Calendar.DAY_OF_MONTH); byte y0 = (byte) (year / 1000 +'0'); byte y1 = (byte) ((year / 1000)% 10 +'0'); byte y2 = (byte) ((year / 10)% 10 +'0') Byte Y3 = (byte) (year% 10 +'0'); byte M0 = (byte) (month / 10 +'0'); byte M1 = (byte) (month% 10 +'0'); byte D0 = (byte) (dayOfMonth / 10 +'0'); byte D1 = (byte) (dayOfMonth% 10 + 0') If (JDKUtils.JVM_VERSION > = 9) {byte [] bytes = new byte [] {y0, y1, y2, y3, M0, M1, d0, D1}; if (JDKUtils.JVM_VERSION = = 17) {return JDKUtils.getStringCreatorJDK17 (). Apply (bytes, StandardCharsets.US_ASCII);} if (JDKUtils.JVM_VERSION = value.length)) {throw new StringIndexOutOfBoundsException (index) } return value [index];}}
After JDK 9, charAt is more expensive
Public final class String {private final byte [] value; private final byte coder; public char charAt (int index) {if (isLatin1 ()) {return StringLatin1.charAt (value, index);} else {return StringUTF16.charAt (value, index);} the method of obtaining String.value
The methods to obtain String.value are as follows:
Use Field reflection
Use Unsafe
The comparison data of Unsafe and Field reflection on JDK 8 JMH are as follows:
Benchmark Mode Cnt Score Error Units
StringGetValueBenchmark.reflect thrpt 5 438374.685 ±1032.028 ops/ms
StringGetValueBenchmark.unsafe thrpt 5 1302654.150 ±59169.706 ops/ms
5.1.1 using reflection to get String.value
Static Field valueField;static {try {valueField = String.class.getDeclaredField ("value"); valueField.setAccessible (true);} catch (NoSuchFieldException ignored) {}} / char [] chars = (char []) valueField.get (str)
5.1.2 use Unsafe to get String.value
Static long valueFieldOffset;static {try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UNSAFE.objectFieldOffset (valueField);} catch (NoSuchFieldException ignored) {}} / char [] chars = (char []) UNSAFE.getObject (str, valueFieldOffset); static long valueFieldOffset;static long coderFieldOffset Static {try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UNSAFE.objectFieldOffset (valueField); Field coderField = String.class.getDeclaredField ("coder"); coderFieldOffset = UNSAFE.objectFieldOffset (coderField) } catch (NoSuchFieldException ignored) {}} / byte coder = UNSAFE.getObject (str, coderFieldOffset); byte [] bytes = (byte []) UNSAFE.getObject (str, valueFieldOffset); 6. Faster encodeUTF8 method
When you can get String.value directly, you can encodeUTF8 it directly, which is much better than String.getBytes (StandardCharsets.UTF_8).
6.1 public static int encodeUTF8 (char [] src, int offset, int len, byte [] dst, int dp) of JDK8 High performance encodeUTF8 {int sl = offset + len; int dlASCII = dp + Math.min (len, dst.length); / / ASCII only optimized loop while (dp)
< dlASCII && src[offset] < '\u0080') { dst[dp++] = (byte) src[offset++]; } while (offset < sl) { char c = src[offset++]; if (c < 0x80) { // Have at most seven bits dst[dp++] = (byte) c; } else if (c < 0x800) { // 2 bytes, 11 bits dst[dp++] = (byte) (0xc0 | (c >> 6); dst [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c
< ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >='\ uD800' & & c
< ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { char d = src[ip + 1]; // d >='\ uDC00' & & d
< ('\uDFFF' + 1) if (d >='\ uDC00' & & d
< ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c >); dst [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc & 0x3f)); offset++ / / 2 chars}} else {/ / 3 bytes, 16 bits dst [dp++] = (byte) (0xe0 | (c > > 12); dst [dp++] = (byte) (0x80 | (c > > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (c & 0x3f));}} return dp;}
An example of using encodeUTF8 method
Char [] chars = UNSAFE.getObject (str, valueFieldOffset); / / ensureCapacity (chars.length * 3) byte [] bytes =...; / / int bytesLength = IOUtils.encodeUTF8 (chars, 0, chars.length, bytes, bytesOffset)
In this way, there will be no extra arrayCopy operations for encodeUTF8 operations, and the performance will be improved.
6.1.1 performance test comparison
Test code
Public class EncodeUTF8Benchmark {static String STR = "01234567890ABCDEFGHIJKLMNOPQRSTUVZZZZabcdefghijklmnopqrstuvwzyz1234567890"; static byte [] out; static long valueFieldOffset; static {out = new byte [STR.length () * 3]; try {Field valueField = String.class.getDeclaredField ("value"); valueFieldOffset = UnsafeUtils.UNSAFE.objectFieldOffset (valueField);} catch (NoSuchFieldException e) {e.printStackTrace () } @ Benchmark public void unsafeEncodeUTF8 () throws Exception {char [] chars = (char []) UnsafeUtils.UNSAFE.getObject (STR, valueFieldOffset); int len = IOUtils.encodeUTF8 (chars, 0, chars.length, out, 0);} @ Benchmark public void getBytesUTF8 () throws Exception {byte [] bytes = STR.getBytes (StandardCharsets.UTF_8); System.arraycopy (bytes, 0, out, 0, bytes.length) } public static void main (String [] args) throws RunnerException {Options options = new OptionsBuilder () .include (EncodeUTF8Benchmark.class.getName ()) .mode (Mode.Throughput) .timeUnit (TimeUnit.MILLISECONDS) .forks (1) .build (); new Runner (options). Run ();}}
Test result
EncodeUTF8Benchmark.getBytesUTF8 thrpt 5 20690.960 ±5431.442 ops/ms
EncodeUTF8Benchmark.unsafeEncodeUTF8 thrpt 5 34508.606 ±55.510 ops/ms
As a result, the coding overhead of calling the encodeUTF8 method directly through unsafe + is 58% of that of newStringUTF8.
6.2 method of JDK9/11/17 High performance encodeUTF8 public static int encodeUTF8 (byte [] src, int offset, int len, byte [] dst, int dp) {int sl = offset + len; while (offset)
< sl) { byte b0 = src[offset++]; byte b1 = src[offset++]; if (b1 == 0 && b0 >= 0) {dst [dp++] = b0;} else {char c = (char) (b0 & 0xff) 6)); dst [dp++] = (byte) (0x80 | (c & 0x3f));} else if (c > ='\ uD800' & & c
< ('\uDFFF' + 1)) { //Character.isSurrogate(c) but 1.7 final int uc; int ip = offset - 1; if (c >='\ uD800' & & c
< ('\uDBFF' + 1)) { // Character.isHighSurrogate(c) if (sl - ip < 2) { uc = -1; } else { b0 = src[ip + 1]; b1 = src[ip + 2]; char d = (char) (((b0 & 0xff) = '\uDC00' && d < ('\uDFFF' + 1)) { // Character.isLowSurrogate(d) uc = ((c >); dst [dp++] = (byte) (0x80 | (uc > > 12) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc > 6) & 0x3f)); dst [dp++] = (byte) (0x80 | (uc & 0x3f)); offset++ / / 2 chars}} else {/ / 3 bytes, 16 bits dst [dp++] = (byte) (0xe0 | (c > > 12); dst [dp++] = (byte) (0x80 | (c > > 6) & 0x3f); dst [dp++] = (byte) (0x80 | (c & 0x3f)) } return dp;}
An example of using encodeUTF8 method
Byte coder = UNSAFE.getObject (str, coderFieldOffset); byte [] value = UNSAFE.getObject (str, coderFieldOffset); if (coder = = 0) {/ / ascii arraycopy} else {/ / ensureCapacity (chars.length * 3) byte [] bytes =...; / / int bytesLength = IOUtils.encodeUTF8 (value, 0, value.length, bytes, bytesOffset);}
In this way, there will be no extra arrayCopy operations for encodeUTF8 operations, and the performance will be improved.
The above is about the content of this article on "how to improve the performance of Java string encoding and decoding". I believe we all have a certain understanding. I hope the content shared by the editor will be helpful to you. If you want to know more about the relevant knowledge, please follow the industry information channel.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.