How to understand Unicode and JavaScript 07/11 Update SLTechnology News&Howtos

How to understand Unicode and JavaScript

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

How to understand Unicode and JavaScript, I believe that many inexperienced people are at a loss about this. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Last month, I did a sharing, detailing the Unicode character set and the JavaScript language's support for it. The following is the speech for this sharing.

What is Unicode?

Unicode comes from a very simple idea: to include all the characters in the world in a set, as long as the computer supports this character set, it can display all the characters, and there will be no more garbled codes.

It starts at 0 and assigns a number to each symbol, which is called a "code point". For example, the symbol for code point 0 is null (meaning that all binary bits are 0).

Ubun0000 = null

In the above formula, U+ indicates that the hexadecimal number immediately following is the code point of Unicode.

At present, the * * version of Unicode is version 7. 0, with a total of 109449 symbols, of which 74500 are Chinese, Japanese and Korean. It can be approximately believed that more than 2/3 of the existing symbols in the world come from East Asian characters. For example, the code point for "good" in Chinese is 597D in hexadecimal.

Utility 597D = good

With so many symbols, Unicode is not an one-time definition, but a partition definition. Each zone can hold 65536 characters, called a plane. Currently, there are 17 (25) planes, that is, the size of the entire Unicode character set is now 221.

The first 65536 character bits, called the basic plane (abbreviated as BMP), range from 0 to 216-1, and in hexadecimal form from Ubun0000 to U+FFFF. All the most common characters are placed on this plane, which is defined and published by Unicode.

The rest of the characters are placed in the auxiliary plane (abbreviated SMP), with code points ranging from Ubun010000 to U+10FFFF.

II. UTF-32 and UTF-8

Unicode only specifies the code point of each character. What kind of byte order is used to represent this code point is related to the coding method.

The most intuitive coding method is that each code point is represented by four bytes, and the byte content corresponds to the code point one by one. This coding method is called UTF-32. For example, code point 0 is represented by four-byte zeros, and code point 597D is preceded by two-byte zeros.

Utt0000 = 0x0000 0000 UBG 597D = 0x0000 597D

The advantage of UTF-32 is that the conversion rules are simple and intuitive and the search efficiency is high. The disadvantage is a waste of space, the same content of English text, it will be four times larger than ASCII coding. This disadvantage is so fatal that no one actually uses this coding method, and the HTML 5 standard explicitly states that web pages must not be encoded as UTF-32.

What people really need is a space-saving coding method, which led to the birth of UTF-8. UTF-8 is a variable length encoding method, with character lengths ranging from 1 byte to 4 bytes. The more commonly used characters are, the shorter the bytes are. The first 128characters are represented by only one byte, exactly the same as the ASCII code.

Number range byte 0x0000-0x007F10x0080-0x07FF20x0800-0xFFFF30x010000-0x10FFFF4

III. Brief introduction of UTF-16

Because of this space-saving feature, UTF-8 has become the most common web coding on the Internet. However, it has little to do with today's topic, so I won't go any further. For specific transcoding methods, you can refer to the character Encoding Notes I wrote many years ago.

UTF-16 coding is between UTF-32 and UTF-8, and combines the characteristics of both fixed-length and variable-length coding methods.

Its coding rule is simple: characters in the basic plane occupy 2 bytes and characters in the auxiliary plane occupy 4 bytes. In other words, the encoding length of UTF-16 is either 2 bytes (Ubun0000 to U+FFFF) or 4 bytes (Ubun010000 to U+10FFFF).

So there is a question, when we encounter two bytes, how can we tell whether it is a character or whether it needs to be read together with the other two bytes?

Cleverly speaking, I don't know if it was deliberately designed, but in the basic plane, from U+D800 to U+DFFF is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map the characters of the auxiliary plane.

Specifically, there are 220 character bits in the auxiliary plane, that is, at least 20 binary bits are required for these characters. UTF-16 splits the 20 bits into two halves. The first 10 bits are mapped from U+D800 to U+DBFF, called high bit (H), and the last 10 bits are mapped from U+DC00 to U+DFFF, called low bit (L). This means that the character of an auxiliary plane is split into two basic plane character representations.

So, when we encounter two bytes and find that its code point is between U+D800 and U+DBFF, we can conclude that the code point of the next two bytes should be between U+DC00 and U+DFFF, and these four bytes must be read together.

IV. The transcoding formula of UTF-16

When a Unicode code point is converted to UTF-16, it is first distinguished whether it is a basic plane character or an auxiliary plane character. In the case of the former, the code point is directly converted to the corresponding hexadecimal form with a length of two bytes.

Utility 597D = 0x597D

If it is an auxiliary plane character, the transcoding formula is given in Unicode version 3.0.

H = Math.floor ((c-0x10000) / 0x400) + 0xD800 L = (c-0x10000)% 0x400 + 0xDC0

Take the character as an example, it is an auxiliary plane character, the code point is U+1D306, the calculation process of converting it to UTF-16 is as follows.

H = Math.floor ((0x1D306-0x10000) / 0x400) + 0xD800 = 0xD834 L = (0x1D306-0x10000)% 0x400+0xDC00 = 0xDF06

Therefore, the UTF-16 encoding of characters is 0xD834 DF06, with a length of four bytes.

What kind of coding does JavaScript use?

The JavaScript language uses the Unicode character set, but only one encoding method is supported.

This encoding is neither UTF-16 nor UTF-8, let alone UTF-32. JavaScript does not use any of the above coding methods.

JavaScript uses UCS-2!

VI. UCS-2 coding

Why did you suddenly kill a UCS-2? This requires a little bit of history.

In the era before the emergence of the Internet, there were two teams that wanted to develop a unified character set. One is the Unicode team established in 1989, and the other is the earlier UCS team established in 1988. When they discovered each other's existence, they quickly agreed that there was no need for two unified character sets in the world.

In October 1991, the two teams decided to merge character sets. In other words, only one character set, Unicode, will be released from now on, and the previously released character set will be revised, and the code point of UCS will be exactly the same as Unicode.

The actual situation at that time was that the development of UCS was faster than that of Unicode, and as early as 1990, it published a set of coding methods, UCS-2, which used 2 bytes to represent characters that already had code points. At that time, there was only one plane, that is, the basic plane, so 2 bytes was enough. The UTF-16 code was not released until July 1996, and it was clearly declared to be a superset of UCS-2, that is, basic plane characters follow UCS-2 coding, and auxiliary plane characters define a 4-byte representation.

The relationship between the two is simply that UTF-16 has replaced UCS-2, or UCS-2 has been integrated into UTF-16. So, now there is only UTF-16, not UCS-2.

VII. The background of the birth of JavaScript

So why doesn't JavaScript choose the more advanced UTF-16 and use the obsolete UCS-2 instead?

The answer is simple: if you don't want to, you can't. Because when the JavaScript language appeared, there was no UTF-16 coding.

In May 1995, Brendan Eich used 10 days to design the JavaScript language; in October, * interpretation engines came out; and in November of the following year, Netscape formally submitted language standards to ECMA (see the birth of JavaScript for details). Comparing the release time of UTF-16 (July 1996), you can see that Netscape had no other choice at that time, only UCS-2 coding method was available!

VIII. Limitations of JavaScript character functions.

Because JavaScript can only handle UCS-2 encoding, all characters in this language are 2 bytes. If they are 4 bytes, they will be treated as two double-byte characters. JavaScript's character functions are affected by this and cannot return correct results.

Take the character as an example, whose UTF-16 encoding is 4 bytes of 0xD834 DF06. The problem is that a 4-byte encoding is not recognized by UCS-2,JavaScript and is treated as two separate characters, U+D834 and U+DF06. As mentioned earlier, these two code points are empty, so JavaScript will think of them as a string of two empty characters!

The above code indicates that JavaScript believes that the length of the character is 2, the * characters are empty characters, and the code point of the * characters is 0xDB34. None of these results are correct!

To solve this problem, we must make a judgment on the code point, and then adjust it manually. Here is the correct way to traverse the string.

While (+ + index

< length) { // ... if (charCode >

= 0xD800 & & charCode

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.