Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How many characters are there in a string?

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article introduces you how many characters in a string, the content is very detailed, interested friends can refer to, hope to be helpful to you.

According to the Java document, the characters in Java are represented in UTF-16 encoding. The minimum value is\\ u0000 (0) and the maximum value is\\ uffff (65535), that is, a character is represented by 2 bytes. Can Java only represent a maximum of 65535 characters?

Char: The char data type is a single 16-bit Unicode character. It has a minimum value of'\ u0000' (or 0) and a maximum value of'\\ uffff' (or 65535 inclusive).

From The Java ™Tutorials

First of all, let's look at an example:

What do you think is the output of running this program?

Output result:

We know that if String.getBytes () does not specify an encoding format, Java will use the operating system's encoding format to get a byte array. In my MacOS, UTF-8 is used as the character encoding by default (the locale command can see the operating system's encoding), so running on my machine, String.getBytes () returns a byte array of UTF-8 encoding.

String.length returns the length of the Unicode code units.

String.toCharArray returns an array of characters.

The string we set is two unicode characters, and the output is as follows:

Ordinary Chinese characters: the length of the string is 2, each Chinese character is three bytes according to UTF-8 encoding, and the length of the character array looks fine.

Emojis characters: we set up two emojis characters, male and female avatars. The length of the resulting string is 4, UTF-8 encodes 8 bytes, and the length of the character array is 4.

Obscure Chinese characters: we have set up two Chinese characters, one of which is an obscure Chinese character. The length of the resulting string is 3, UTF-8 encodes 7 bytes, and the length of the character array is 3.

It seems that the number of characters in the string is a little different from what we expected. Our string has only two unicode characters, but the output is sometimes 2, sometimes 3, and sometimes 4. Why?

It starts with the history of Java.

Java originally designed Charactor to represent unicode characters in two bytes, which is no problem, because there are relatively few characters in unicode at the beginning. Java 1.1 used Unicode version 1.1.5, JDK 1.1 supported Unicode 2.0, JDK 1.1.7 supported Unicode 2.1, Java SE 1.4 supported Unicode 3.0, and Java SE 5.0 began to support Unicode 4.0.

Until Unicode 3.0, Java had no problem representing unicode characters with two bytes, because Unicode 3.0 had a maximum of 49259 characters, two bytes could represent 65535 characters, and there was enough room for all uicode3.0 characters.

But Unicode 4.0 (in fact, since Unicode 3.1), the character set has been greatly expanded to 96447 characters, Unicode 11.0 already contains 137374 characters.

In Unicode, each character corresponds to a coding point (an integer), represented by U + followed by a hexadecimal number. All characters are divided into 17 planes (numbered 0-16) according to the frequency of use, that is, the basic multilingual plane and the supplementary plane. The basic multilingual plane (Basic Multilingual Plane, referred to as BMP), also known as plane 0, collects the most widely used characters.

In this way, the two-byte design of Java's Charactor is no longer enough to accommodate all Unicode 4 characters, so it may take 4 bytes to represent extended characters, so now Charactor no longer represents a character (code point code point), but a code unit (code unit).

Code Point: code point, a numeric representation of a character. A character set can generally be represented by one or more two-dimensional tables consisting of multiple rows and columns. The intersection of rows and columns in a two-dimensional table is called a code point, and each code point is assigned a unique number, which is called a code point value or code point number. except for non-character code points and reserved code points in some special areas (such as proxy area, special area), each code point uniquely corresponds to a character. From Ubun0000 to U+10FFFF.

Code Unit: a unit of code that has the shortest combination of bits in an encoded text. For UTF-8, the code unit is 8 bits long; for UTF-16, it is 16 bits long. To put it another way, UTF-8 is in the smallest unit of one byte, and UTF-16 is in the smallest unit of two bytes.

The characters of Java are represented internally in UTF-16 encoding, and String.length returns the length of Code Unit, not the length of characters in Unicode. For traditional BMP plane code points, String.length and our traditional understanding of the number of characters is the same, for extended characters, String.length may be twice the length of the characters we understand.

You may ask, for an UTF-16-encoded extended character, which is represented by 4 bytes, will the first two bytes conflict with the BMP plane, so that the program does not know whether it is an extended character or a BMP plane character?

Fortunately, in the BMP plane, the code points between U+D800 and U+DFFF are permanently reserved and not mapped to Unicode characters, so UTF-16 uses the code points of the reserved 0xD800-0xDFFF blocks to encode the code points of the characters in the auxiliary plane.

In UTF-16 coding, the code points in the auxiliary plane are from Utility 10000 to U+10FFFF, with a total of FFFFF, which needs 20 bits to represent. The first integer (two bytes, called the leading agent) holds the first 10 bits of the above 20 bits, and the second integer (called the trailing agent) holds the last 10 bits of the above 20 bits. The values of the leading agent range from 0xD800 to 0xDBFF, followed by the 0xDC00~0xDFFF of the trailing agent.

You can see that the range of the leading agent and the trailing agent falls on the code points that are not mapped in the BMP plane, so there is no conflict, and there is no overlap between the leading agent and the trailing agent. In this way, we get two bytes, so we can directly determine whether it is a BMP plane character, a leading agent in an extended character or a trailing code.

Some foreign users use emojis characters as their nicknames, resulting in some systems can not be displayed correctly, this is because these systems rudely use Charactor to express, in the display time truncation may not be truncated at the correct code point.

When we intercept strings, such as String.substring, we may step on some holes, especially the frequently used emojis characters.

The Code Point method has been provided since Java 1.5 java.lang.String to get the complete number of Unicode characters and Unicode characters:

PublicintcodePointAt (intindex) publicintcodePointBefore (intindex) publicintcodePointCount (intbeginIndex,intendIndex)

Note that the index in these methods uses the code unit value.

About how many characters in a string are shared here, I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report