What are the knowledge points of Unicode and UTF-8 coding? 07/03 Update SLTechnology News&Howtos

What are the knowledge points of Unicode and UTF-8 coding?

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Today, I would like to share with you the relevant knowledge points of Unicode and UTF-8 coding, which are detailed in content and clear in logic. I believe most people still know too much about this knowledge, so share this article for your reference. I hope you can get something after reading this article.

ASCII code

What is an ASCII code?

ASCII code (American Standard Code for Information Interchange) is called American standard information interchange code. It is a computer coding system based on the Latin alphabet. It defines a dictionary that represents common characters.

What characters does the ASCII code contain?

Including "Amurz" (both uppercase and lowercase), data "0-9" and some common symbols.

What are the limitations of ASCII codes?

ASCII, originally designed for American English, can only display 128 codes and is powerless in other languages. To display codes in other languages, you still need to use unicode.

Unicode

What is Unicode?

In order to record all the words in the world uniformly, and to record each character with a unique number, Unicode is produced.

Unicode, also known as UCS (Universal Coded Character Set: international coded character set), is a character collection that organizes and encodes most of the world's text systems, so that computers can present and process text in a simpler way. The latest version of Unicode 11.0 already contains 137439 characters.

The number of Unicode is so large that if it is fully covered, it needs to be expressed in 4 bytes, but it does not have to be done with 4 bytes in computer stored procedures. For some characters, especially those encoded in front of us, we can also save space by one or two bytes. This involves the implementation of unicode.

How many ways can Unicode be implemented?

Unicode is just a collection of characters, each character is represented by a number, but how these numbers are stored in the computer, is all 4 bytes, or 1 to 4 bytes range, which involves the concept of character coding.

We say that there are several ways to implement Unicode, that is to say, how many coding methods does Unicode have?

The commonly used coding methods of Unicode are UTF-8, UCS-2, and UTF-16, and there is another UTF-32 that needs to be mentioned, although it is not very common.

What is the architecture of Unicode?

Since Unicode can store so many characters, it must have its storage rules. If hexadecimal storage is used, what is its storage range? is it straight, from low to high row? In other words, what is the architecture of Unicode?

Unicode currently defines the storage range of its characters as: 0hex to 10FFFFhex, which is divided into 17 sections and can store 1114112 characters, which is far enough for the current (137439).

The section from 0hex to FFFFhex is called the basic multilingual plane BMP (Basic Multilingual Plane), in which the character expression is U+ followed by a hexadecimal number. For example, the unicode for the X character is Ubun0058.

Those beyond the BMP range, that is, the 16 segments of 10000hex-10FFFFhex, need to be represented by 5 to 6 bits, such as U+E0001 and U+10FFFD.

UTF-8 coding

UTF-8 is the most widely used unicode coding method on the Internet, and now accounts for 92% of the entire Internet. It is emphasized here that UTF-8 is only an implementation of Unicode, UTF-8 is an encoding method, and Unicode is a set of characters.

It is a variable-length encoding with a length ranging from 1 byte to 4 bytes.

It is fully compatible with ASCII codes. We know that ASCII codes are composed of 128characters, while the first 128characters in Unicode correspond to ASCII codes one by one.

UCS-2 coding

UCS-2 uses only two bytes (16 bit) to represent characters, which means it can only represent 65536 characters, and it can only represent characters in BMP.

The current number of unicode characters has far exceeded the number of UCS-2, so although UCS-2 is still in use by a lot of software, it has expired.

Because UCS-2 coding is still used by many software, in order to be able to represent characters in the plane outside of BMP, a new coding UTF-16 coding is generated.

UTF-16 coding

UTF-16 is created to solve the problem of UCS-2 coding. It extends from UCS-2.

In the basic multilingual plane, it is exactly the same as the UCS-2 encoding, represented by two bytes

The range from Utility 010000 to U+10FFFF is represented by 4 bytes

The market share of UTF-16 coding is very small compared to UTF-8, accounting for only 0.01% of web pages. And it is mainly used in windows systems, but rarely used in Unix/Linux and MacOS.

UTF-32 coding

UTF-32 represents each character in Unicode with 4 bytes, which takes up much more space than other encodings, which is why people use very little.

These are all the contents of this article entitled "what are the knowledge points of Unicode and UTF-8 coding?" Thank you for reading! I believe you will gain a lot after reading this article. The editor will update different knowledge for you every day. If you want to learn more knowledge, please pay attention to the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.