Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the difference between GBK and UTF-8

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

What is the difference between GBK and UTF-8? for this question, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible way.

The difference between GBK and UTF-8:

GBK is a standard compatible with GB2312 after expansion based on the national standard GB2312. GBK coding is specially used to solve Chinese coding and is double-byte. Both Chinese and English are double-byte. Support for simplified Chinese.

UTF-8 coding is a kind of multi-byte coding used to solve international characters. It uses 8 bits (one byte) for English and 24 bits (three bytes) for Chinese.

If foreigners visit GBK's system, you need to download Chinese language pack support.

GBK contains all Chinese characters (including rare characters); UTF-8 contains characters needed by all countries in the world.

If you mainly do the development of Chinese programs, and the customers are mainly Chinese users, then use GBK, because the Chinese encoded by UTF-8 uses three bytes, and GBK saves space (two bytes).

Risks of transforming GBK into UTF-8:

The earliest use of GB2312, history encountered the problem of rare Chinese characters garbled (mainly the customer's name, in order to choose some obscure Chinese characters for some good first name), and then changed to GBK to be fully compatible.

If it is transformed into UTF-8, all aspects of the business will have to do regression tests on Chinese and rare characters, and the test cost is higher.

We still recommend that customers install Chinese language packs. At present, foreign customers have chosen to install Chinese language packs.

GBK, GB2312, etc., and UTF8 must be encoded by Unicode before they can be converted to each other:

Use UTF-8 to represent English characters with 1 byte and Chinese characters with 3 bytes. GBK is a Chinese character coding standard with a length of 2 bytes. Therefore, the conversion of these two character sets requires a transition through Unicode coding in the middle.

GBK, GB2312-"Unicode -" UTF8

UTF8-"Unicode -" GBK, GB2312

People code all the characters in the world, called Unicode (uniform character code). In the coded characters, many characters are not often used, using too long bytes means not only a waste of memory, but also greatly reduce the speed of reading and writing databases (so high-performance databases are based on ASCII, such as Oracle database), so UTF-8 (Unicode format Converter) is proposed.

UTF-8 is flexible, with a length of 1 to 6 bytes, and 8 in UTF-8 means that 1 character is at least 8 bits long.

UTF-16 is flexible, with a length of 2 to 4 bytes, and 16 in UTF-16 means that a character is at least 16 bits long.

UTF-32 is inflexible, with a fixed length of 4 bytes, and 32 in UTF-32 means that the length of 1 character is at least 32 bits.

Problems caused by character set error conversion:

UTF8 string-- > turn byte stream-- > press GBK to convert string (garbled)-- > then go back to byte stream-- > press UTF8 to convert string (still garbled)

It is normal for byte streams encoded in UTF-8 format to be converted to strings according to the GBK character set. But convert it back to byte stream, and then use the UTF-8 character set to a string, still garbled. This makes me wonder, although the use of the wrong character set will inevitably lead to garbled, but the byte information has not changed, so then converted to byte stream, with the correct character set decoding, should get a normal string. But the fact is that strings converted by the wrong character set cannot be restored to the original character set.

The root cause of the problem

The root cause of the problem is a change in byte length.

Because UTF-8 encodes one byte in English and three bytes in Chinese, while both English and GBK use two bytes.

When GBK or UTF-8 encounters characters that cannot be parsed, they will use special characters instead, resulting in the loss of the original byte information and cannot be recovered.

Analysis of error conversion

UTF-8 (three bytes) → GBK (two bytes)

For a string of UTF-8-encoded byte streams, GBK is used to decode. Two consecutive bytes greater than 127are considered to be a GBK-encoded character; if only one byte greater than 127is read, an error occurs and cannot be parsed. At this point, replace the error byte with the character 'destroy', and the ASCII code is 63.

Take the word "Fan" as an example, the UTF-8 code uses three bytes to represent the character, and the bytecode is [11100110, 10101000, 10001010] ([e6, a8, 8a]). When decoding with GBK, if you read that the first byte is greater than 127, two bytes are parsed into a GBK character. The first two bytes, e68a, are parsed into GBK characters-- characters. The third byte cannot be parsed, so it is assigned to evaluate, and the final result is invalid.

As you can see, the last byte of information is lost, from 8a to 3F, and even if the result is converted to a byte stream, it cannot be parsed correctly with the utf-8 character set.

GBK (two bytes) → UTF-8 (three bytes)

For a string of GBK-encoded byte streams, UTF-8 decoding is used. UTF-8 has strict requirements for the format of bytes, and when parsing a character fails, use 'encoding' (UTF-8 is encoded as EF BF BD) instead.

Continue to take "Fan" as an example, its GBK bytecode is [10110111, 10101110] ([B7, AE]). When using UTF-8 decoding, according to the rule, before a byte starting with 10 is required, there must be a byte identifying the length of a character, so neither byte can be parsed. The last string is "". As you can see, all the byte information is lost, so you can no longer parse the string using GBK.

Note that UTF-8 is replaced with characters and is in characters. For example, [11100110, 10101000, 01000001] uses UTF-8 decoding, and the result is "A", not "A". Depending on the format of the first byte, UTF-8 expects to convert three bytes into one character. But the last byte does not meet the requirements, so the first two bytes are replaced by a sink. Instead of every byte being replaced by a sink.

Str=' learn 'print (str) print (len (str)) print (str.encode ()) print (str.encode (' GBK')) print (str.encode ('UTF-8')) print (len (str.encode (' GBK')) print (len (str.encode ('UTF-8')-learn 1b'\ xe5\ xad\ xa6'b'\ xd1\ xa7'b'\ xe5\ xad\ xa6'23 about the difference between GBK and UTF-8 The answer to any question is shared here. I hope the above content can help you to a certain extent, if you still have a lot of doubts to be solved, you can follow the industry information channel to learn more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report