Analysis of string and Encoding examples in java 12/29 Update SLTechnology News&Howtos

Analysis of string and Encoding examples in java

2025-12-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article introduces the relevant knowledge of "string and coding example Analysis in java". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Use incomplete characters with variable length encoding to create a string

The underlying storage char [] of String in java is encoded in UTF-16.

Notice that after JDK9, the underlying storage of String has become byte [].

StringBuilder and StringBuffer still use char [].

Then when we use the InputStreamReader,OutputStreamWriter and String classes for String reading, writing and building, we need to involve UTF-16 and other coding transformations.

Let's take a look at the problems that might be encountered in the transition from UTF-8 to UTF-16.

Take a look at the UTF-8 code first:

UTF-8 uses 1 to 4 bytes to represent the corresponding characters, while UTF-16 uses 2 or 4 bytes to represent the corresponding characters.

What might be the problem with the conversion?

Public String readByteWrong (InputStream inputStream) throws IOException {byte [] data = new byte [1024]; int offset = 0; int bytesRead = 0; String str= ""; while ((bytesRead = inputStream.read (data, offset, data.length-offset))! =-1) {str + = new String (data, offset, bytesRead, "UTF-8"); offset + = bytesRead If (offset > = data.length) {throw new IOException ("Too much input");}} return str;}

In the above code, we read byte from Stream and convert it to String each time we read byte. Obviously, UTF-8 is a variable-length code, and if you happen to read part of the UTF-8 code while reading the byte, the built String will be wrong.

We need to do the following:

Public String readByteCorrect (InputStream inputStream) throws IOException {Reader r = new InputStreamReader (inputStream, "UTF-8"); char [] data = new char [1024]; int offset = 0; int charRead = 0; String str= ""; while ((charRead = r.read (data, offset, data.length-offset)! =-1) {str + = new String (data, offset, charRead); offset + = charRead If (offset > = data.length) {throw new IOException ("Too much input");}} return str;}

We use InputStreamReader,reader to automatically convert the read data into char, that is, to automatically convert UTF-8 to UTF-16.

So there will be no problem.

Char cannot represent all Unicode

Because char is encoded using UTF-16, for UTF-16, Ubun0000 to U+D7FF and U+E000 to U+FFFF, a range of characters, can be directly represented by a char.

But for Utility 010000 to U+10FFFF is represented by two char in the range of 0xD800-0xDBFF and 0xDC00-0xDFFF.

In this case, it's interesting to merge two char, and a single char doesn't make any sense.

Consider one of our subString methods below, which is intended to find the first non-letter position in the input string and then intercept the string.

Public static String subStringWrong (String string) {char ch; int i; for (I = 0; I < string.length (); I + = 1) {ch = string.charAt (I); if (! Character.isLetter (ch)) {break;}} return string.substring (I);}

In the above example, we take out the char characters in string one by one for comparison. If you encounter a character in the Utility 010000 to U+10FFFF range, you may report an error, mistakenly thinking that the character is not letter.

We can modify it as follows:

Public static String subStringCorrect (String string) {int ch; int i; for (I = 0; I < string.length (); I + = Character.charCount (ch)) {ch = string.codePointAt (I); if (! Character.isLetter (ch)) {break;}} return string.substring (I);}

We use the codePointAt method of string to return the Unicode code point of the string, and then use that code point to determine the isLetter.

Note the use of Locale

In order to implement internationalization support, java introduces the concept of Locale, and because of Locale, it will cause unexpected changes in the string conversion process.

Consider the following example:

Public void toUpperCaseWrong (String input) {if (input.toUpperCase () .equals ("JOKER")) {System.out.println ("match!");}}

What we expect is English, and if the system sets Locale to be another language, input.toUpperCase () may get a completely different result.

Fortunately, toUpperCase provides a parameter for locale, which we can modify as follows:

Public void toUpperCaseRight (String input) {if (input.toUpperCase (Locale.ENGLISH) .equals ("JOKER")) {System.out.println ("match!");}}

Similarly, DateFormat has problems:

Public void getDateInstanceWrong (Date date) {String myString = DateFormat.getDateInstance () .format (date);} public void getDateInstanceRight (Date date) {String myString = DateFormat.getDateInstance (DateFormat.MEDIUM, Locale.US) .format (date);}

When comparing strings, we must take into account the impact of Locale.

Encoding format in file reading and writing

When we use InputStream and OutputStream to write files to each other, there is no problem of transcoding because it is binary.

But if we use Reader and Writer to object a file, we need to consider the problem of file encoding.

If the file is encoded by UTF-8 and we read it with UTF-16, there will be a problem.

Consider the following example:

Public void fileOperationWrong (String inputFile,String outputFile) throws IOException {BufferedReader reader = new BufferedReader (new FileReader (inputFile)); PrintWriter writer = new PrintWriter (new FileWriter (outputFile)); int line = 0; while (reader.ready ()) {line++; writer.println (line+ ":" + reader.readLine ());} reader.close (); writer.close ();}

We want to read the source file and insert the line number into the new file, but we don't take coding into account, so we may fail.

The above code can be modified like this:

BufferedReader reader = new BufferedReader (new InputStreamReader (new FileInputStream (inputFile), Charset.forName ("UTF8")); PrintWriter writer = new PrintWriter (new OutputStreamWriter (new FileOutputStream (outputFile), Charset.forName ("UTF8")

The correctness of the operation is ensured by forcing the encoding format to be specified.

Do not encode non-character data into strings

We often have the need to encode binary data into a string String and store it in the database.

Binary is expressed in Byte, but we can see from our introduction above that not all Byte can be represented as characters. If you convert a Byte that cannot be represented as a character, a problem may occur.

Look at the following example:

Public void convertBigIntegerWrong () {BigInteger x = new BigInteger ("1234567891011"); System.out.println (x); byte [] byteArray = x.toByteArray (); String s = new String (byteArray); byteArray = s.getBytes (); x = new BigInteger (byteArray); System.out.println (x);}

In the above example, we convert BigInteger to byte numbers (big-end sequence), and then convert byte numbers to String. Finally, the String is converted to BigInteger.

Let's take a look at the results:

123456789101180908592843917379

It was found that the conversion was not successful.

Although String can receive the second parameter, incoming character encoding, currently java supports character encodings: ASCII,ISO-8859-1 Magi UTFly8, UTF-8LE,UTF-16, and so on. By default, String is also large-end sequence.

How to modify the above example?

Public void convertBigIntegerRight () {BigInteger x = new BigInteger ("1234567891011"); String s = x.toString (); / / converted into a storable string byte [] byteArray = s.getBytes (); String ns = new String (byteArray); x = new BigInteger (ns); System.out.println (x);}

We can first convert the BigInteger to a representable string with the toString method, and then convert it.

We can also use Base64 to encode the Byte array without losing any characters, as shown below:

Public void convertBigIntegerWithBase64 () {BigInteger x = new BigInteger ("1234567891011"); byte [] byteArray = x.toByteArray (); String s = Base64.getEncoder () .encodeToString (byteArray); byteArray = Base64.getDecoder () .decode (s); x = new BigInteger (byteArray); System.out.println (x) } this is the end of the content of "string and Encoding example Analysis in java". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.