Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the difference between GBK, UTF8, GB2312 and UTF-8

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the difference between GBK, UTF8, GB2312 and UTF-8". In daily operation, I believe many people have doubts about the difference between GBK, UTF8, GB2312 and UTF-8. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the questions of "what is the difference between GBK, UTF8, GB2312 and UTF-8?" Next, please follow the editor to study!

UTF-8:Unicode TransformationFormat-8bit, which allows BOM, but usually does not contain BOM. It is a kind of multi-byte coding for international characters. It uses 8 bits (one byte) in English and 24 (three bytes) in Chinese. UTF-8 includes all the characters needed by all countries in the world. It is an international code with strong universality. UTF-8-encoded text can be displayed on browsers that approve of the UTF8 character set in various countries. For example, if it is a UTF8 code, Chinese can also be displayed on the native English IE, and they do not need to download the Chinese language approval package of IE.

GBK is the standard of national standard GB2312 that is compatible with GB2312 after it has been expanded. The text coding of GBK is displayed in double bytes, that is, both Chinese and English characters are represented by double bytes, and the highest bit is set to 1 in order to identify Chinese. GBK, which includes all Chinese characters, is a national code, and its popularity is worse than UTF8, but the database occupied by UTF8 is larger than GBK.

GBK, GB2312, etc., and UTF8 must all be converted to each other through Unicode coding skills:

GBK 、 GB2312--Unicode--UTF8

UTF8--Unicode--GBK 、 GB2312

CSS5 is easy to say in terms of function:

1. GBK usually means that GB2312 coding only supports simplified Chinese calligraphy.

2. Utf usually refers to UTF-8, and supports simplified Chinese calligraphy, traditional Chinese characters, English, Japanese, Korean and other languages (in favor of more calligraphy)

3. Foreign countries usually use utf-8 and gb2312, depending on their own needs

The details are as follows:

For a site or forum, if there are more English and Chinese characters, then take the initiative to use UTF-8 to save space. However, plug-ins for Duoduo forums usually only support GBK.

The differences between two codes are explained in detail

To put it simply, unicode,gbk and big-5 codes are the values of coding, and things like utf-8,uft-16 are the expression mode of this value. The first three codes are compatible, the same Chinese character, and the values of those three codes are purely different. For example, the uncode value of "Han" is different from that of gbk. If uncode is a040 and gbk is b030, and uft- 8 is the form of expression of that value. Utf-8 code is completely organized only for uncode. If you want to transfer GBK to UTF-8, you must first transfer uncode code, and then transfer utf-8 to OK.

Talk about Unicode coding and briefly explain the content words such as UCS, UTF, BMP, BOM, etc.

Title 1:

Using the Save as in Windows notepad, you can convert between GBK, Unicode, Unicode big endian and UTF-8 encodings. Also a txt file, how does Windows recognize the encoding method?

I found out a long time ago that the opening of the txt file encoded by Unicode, Unicode bigendian, and UTF-8 will have a few extra bytes. The parting is FF, FE (Unicode), FE, FF (Unicode bigendian), EF, BB, BF (UTF-8). But what yardstick are these signs based on?

Question 2:

Recently, I saw a ConvertUTF.c on the Internet, which realized the conversion of UTF-32, UTF-16 and UTF-8. I am aware of the coding methods such as Unicode coding (UCS2), GBK and UTF-8. But this move made me a little confused. I can't remember the relationship between UTF-16 and UCS2.

After checking the relevant information, I finally made these issues clear, and specifically understood some details of Unicode. Write an article and send it to friends who have had similar problems. This article is written as much as possible, but readers are required to know what is bytes and what is hexadecimal.

0, big endian and little endian

Big endian and littleendian are different ways for CPU to deal with multiple bytes. For example, the Unicode code of "Han" is 6C49. So when you write it into the document, will you write 6C or 49 in front of it after all? If you write 6C in front of it, it is big endian. If you write 49 in front of it, it equals little endian.

The word "endian" comes from Gulliver's travels. The civil war in Zhengren country began when eating eggs from the Big-Endian or the Little-Endian. As a result, there were six rebellions, one emperor lost his life and the other lost his throne.

We generally translate endian into byte order, and big endian and little endian are called "big tail" and "small tail".

1. Character coding, internal code, special introduction of Chinese character coding

After the character must be encoded, the skill pattern is punished by the master computer. The default coding method used by the common computer is the internal code of the computer. In the early days, the master computer used 7-bit ASCII coding. In order to deal with Chinese characters, the graders designed GB2312 for simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code size of the Chinese character area is high byte from B0-F7, low byte from A1-FE, and the code point occupied is 72 "94" 6768. Five of them are D7FA-D7FE.

There are too few Chinese characters approved by GB2312. The Chinese character reduction specification GBK1.0 in 1995 contains 21886 symbols, which is divided into Chinese character area and graphic symbol area. The Chinese character area consists of 21003 characters.

From ASCII, GB2312 to GBK, these encoding methods are backward compatible, that is, the same character is always encoded the same way in these schemes, and the back standard supports more characters. In these codes, English and Chinese can be handled competitively. The way to distinguish Chinese coding is that the highest bit of high byte is not 0. According to the common name of the French, GB2312 and GBK belong to double-byte character set (DBCS).

GB18030 in 2000 is the official national standard to replace GBK1.0. This scale contains 27484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major ethnic languages. In terms of Chinese character vocabulary, GB18030 added 6582 Chinese characters of CJK cut A (Unicode code 0x3400-0x4db5) on the basis of 20902 Chinese characters of GB13000.1, with a total of 27484 Chinese characters.

CJK means China, Japan and South Korea. In order to save code points, Unicode encodes pen and ink in Chinese, Japanese and Korean languages. GB13000.1 is the Chinese version of ISO/IEC 10646-1, up to Unicode 1.1.

The coding of GB18030 adopts single-byte, double-byte and 4-byte schemes. Single-byte, double-byte and GBK are purely compatible. The code point of 4-byte code is 6582 Chinese characters containing CJK simplified A. For example, the 0x3400 of UCS should be encoded in GB18030 as 8139EF30 and the 0x3401 of 0x3401 should be 8139EF31 in GB18030.

Microsoft supplied GB18030's upgrade package, but this upgrade package only provides a new set of 6582 Chinese fonts that support CJK extension A: new Arial-18030, which does not tamper with the internal code. The internal code of Windows is still GBK.

Here are some details:

The original GB2312 location code, from the location code to the internal code, needs to add A0 on the high byte and the low byte.

For any character encoding, the order of encoding units is specified by the encoding scheme and has nothing to do with endian. For example, the coding unit of GBK is bytes, which flashes a Chinese character with two bytes. The order of the two bytes is static and is not affected by the CPU byte order. The encoding unit of UTF-16 is word (double bytes). The order between word is specified by the coding scheme, and the byte division within word will be affected by endian. The back will also introduce UTF-16.

The highest bit of two bytes of GB2312 is no more than 1. But there are only 128-128-16384 code points that match this premise. Therefore, the highest low-byte bit of GBK and GB18030 may not be 1. However, this does not affect the synthesis of the DBCS character stream: when reading the DBCS character stream, whenever you encounter a byte with a high bit of 1, you can encode the next two bytes as a double byte, regardless of what the high bit of the low byte is.

2. Unicode, UCS and UTF

It was mentioned earlier that coding methods from ASCII, GB2312, GBK to GB18030 are backward compatible. Unicode is only compatible with ASCII (or, more accurately, with ISO-8859-1) and is not compatible with GB codes. For example, the Unicode code of the word "Han" is 6C49 and the GB code is BABA.

Unicode is also a character coding method, but it is designed by international organizations and can accommodate all languages around the world. The name of Unicode is "UniversalMultiple-Octet Coded Character Set", which is simply referred to as UCS. UCS can be seen as an abbreviation for "Unicode CharacterSet".

According to Wikipedia (http://zh.wikipedia.org/wiki/)), there are historically two agencies that tried to design Unicode independently, namely, the International Organization for Standardization (ISO) and an association of software manufacturers (unicode.org). ISO reclaimed the ISO 10646 project, and the Unicode Association developed the Unicode project.

In 1991, both sides realized that there was no need for two incompatible character sets in the country. So they separate unilateral task achievements at the end and work together to build a single coding table. From the beginning of Unicode2.0, the Unicode project adopted the same font and code as ISO 10646-1.

At present, the two projects still exist and issue their own specifications independently. The latest version of the Unicode Association is Unicode 4.1.0 in 2005. The latest scale of ISO is ISO 10646-3 2003.

UCS only specifies how to encode, not how to transmit and retain this code. For example, the UCS code of the word "Han" is 6C49, and I can use 4 ascii digits to transmit and work with this code, or I can use utf-8 encoding: 3 intermittent bytes E6 B189 to reveal and express it. The key is that both sides of the communication should admit it. UTF-8, UTF-7 and UTF-16 are all widely accepted solutions. One of the rated advantages of UTF-8 is its pure compatibility with ISO-8859-1. UTF is the abbreviation of "UCS Transformation Format".

IETF's RFC2781 and RFC3629, with the consistent style of RFC, clearly, vividly and solemnly depict the coding doorway of UTF-16 and UTF-8. I never remember that IETF is the abbreviation of Internet Engineering Task Force. But RFC, which IETF strives to defend, is the root of all norms on the network.

2.1Internal Code and code page

At present, the kernel of Windows has also supported the Unicode character set, so that all languages in the world can be supported on the kernel. However, because a large number of existing steps and documents adopt the coding of a specific language, it is impossible for GBK,Windows not to support the existing coding and use Unicode instead.

Windows applies a code page (code page) to adapt to each country and region. Code page can be understood as the internal code mentioned earlier. The code page corresponding to GBK is CP936.

Microsoft also said code page:CP54936 for the GB18030 world. However, because GB18030 has one-to-one small 4-byte encoding, and the code page of Windows only supports single-byte and double-byte encoding, this code page cannot be really applied.

3 、 UCS-2 、 UCS-4 、 BMP

UCS comes in two colors: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded in two bytes, and UCS-4 is encoded in 4 bytes (in theory, only 31 bits are used, and the highest bit must be 0). It asks us to do some easy math games:

UCS-2 has 2 ^ 16 = 65536 code points, and UCS-4 has 2 ^ 31 = 2147483648 code points.

UCS-4 is divided into 2 ^ 7 = 128 group based on the highest byte with the highest bit of 0. Each group is divided into 256plane according to the next highest byte. Each plane is divided into 256lines (rows) according to the third byte, and each line collects 256cells. Of course, the cells on the same line only ends one byte differently, and the rest is the same.

Plane 0 of group 0 is called Basic Multilingual Plane, or BMP. Roughly speaking, in UCS-4, the code point with a 0 higher than two bytes is called BMP.

Remove the first two zero bytes from the BMP of UCS-4 and get the UCS-2. By adding two zero bytes before the two bytes of UCS-2, you get the BMP of UCS-4. The current UCS-4 specification does not have any characters assigned outside the BMP.

4. UTF coding

UTF-8 equals to encoding UCS in 8-bit units. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 Encoding (hexadecimal) UTF-8 byte stream (binary)

0000-007F 0xxxxxxx

0080-07FF 110xxxxx 10xxxxxx

0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode code of the word "Han" is 6C49. 6C49 is between 0800-FFFF, so be sure to use a 3-byte template: 1110xxxx 10xxxxxx10xxxxxx. Write 6C49 as binary: 0110 110001 001001, use this bit stream to order x in the template, and get: 1110011010110001 10001001, that is, E6B89.

Readers can use notepad to test whether our code is correct. Be aware that UltraEdit will actively convert to UTF-16 when opening utf-8-encoded text files, which may lead to muddling. You can turn off this option in the configuration. A better tool is Hex Workshop.

UTF-16 develops the coding of UCS in 16-bit units. For UCS codes smaller than 0x10000, UTF-16 coding is equal to the 16-bit unsigned integer corresponding to UCS code. For UCS codes not less than 0x10000, an algorithm is defined. Because of the actual application of UCS2, maybe the BMP of UCS4 must be smaller than that of 0x10000, so at present, it can be considered that UTF-16 and UCS-2 are similar. However, UCS-2 is only a coding scheme, while UTF-16 is used for theoretical transmission, so we have to think about the title of byte order.

5. Byte order and BOM of UTF

UTF-8 uses bytes as the encoding unit and does not have the problem of byte order. UTF-16 encodes in two bytes. Before interpreting a UTF-16 text, you must first clarify the byte order of each encoding unit. For example, the Unicode code of "Kui" is 594E, and the Unicode code of "B" is 4E59. If we receive the UTF-16 byte stream "594E", then this is "Kui" as usual "B"?

The byte order of the symbols introduced in the Unicode specification is BOM. BOM is not the BOM table of "Bill Of Material", but Byte order Mark. BOM is a little clever idea:

There is a character called "ZERO WIDTH NO-BREAKSPACE" in the UCS code, which is encoded as FEFF. FFFE is a character that is not available in UCS, so it should not be in practice at that time. The UCS specification advocates that we transmit the character "ZERO WIDTH NO-BREAK SPACE" before transmitting a byte stream.

If the recipient receives FEFF, say that the byte stream is Big-Endian; if you receive FFFE, explain that the byte stream is Little-Endian. So the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.

UTF-8 does not need BOM to untangle the byte order, but you can use BOM to reveal the encoding. The UTF-8 code for the character "ZERO WIDTH NO-BREAKSPACE" is EF BB BF (the reader can verify it with the coding method we introduced earlier). So if the recipient receives a byte stream that begins with EF BBBF, he knows that this is UTF-8 coding.

Windows is the application of BOM to symbolize the encoding of text files.

6. Further reference materials

The main reference material in this paper is "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I also found two articles that looked good, but because I found the answer to all my questions, I didn't read them:

"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a)

"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03)

I have written software packages that convert UTF-8, UCS-2 and GBK to each other, including versions that use Windows API and do not use Windows API. If there are years later, I will tidy it up and put it on my overall home page.

I began to write this article after I wanted to clarify all the topics. I thought I would be able to write it in a minute. Unexpectedly, after thinking about the language and verification details, the banknote was so durable that I wrote from 1:30 to 9:00 in the afternoon. Some readers are expected to benefit from it.

Appendix 1 talks more about area code, GB2312, internal code and code page

Some friends still have questions about this sentence in the article:

"the original text of GB2312 is still an area code. From the location code to the internal code, you need to add A0 to the high and low bytes."

Let me explain in detail:

"the original text of GB2312" refers to a national standard in 1980, the basic set of Chinese character coded character set GB2312-80 for national-scale dynamic and dynamic interchange of the people's Republic of China. This scale uses two numbers to encode Chinese characters and Chinese symbols. The first number is called "zone" and the second number is referred to as "bit". Therefore, it is also called location code. 1-9 areas are Chinese symbols, 16-55 areas are first-class Chinese characters, and 56-87 areas are second-class Chinese characters. At present, Windows still has location input methods, such as losing 1601 of output "ah". This location input method can actively identify hexadecimal GB2312 and decimal area code, which means that entering B0A1 will also get "ah". )

The internal code refers to the character coding within the monopoly details. The initial use of trivial internal codes is related to language. At that time, Windows supported Unicode within the system, and from then on, code pages were adapted to all kinds of languages, and the concept of "internal code" was compared with Hu. Microsoft usually describes the code specified by the default code page as internal code.

There is no folk definition of the term internal code, and the code page is just what Microsoft is called. As rank-and-file members, we only know what they are and do not overtest these nouns.

The so-called code page (code page) is equal to the character encoding for a language and text. For example, GBK's code page is CP936,BIG5 's code page is CP950,GB2312 's code page is CP20936.

The concept of the default code page comes out of Windows, that is, what encoding is used by default to interpret characters. The notepad of the metaphorical Windows opens a text file with byte streams: BA, BA, D7, D6. How should Windows explain it?

Is it based on Unicode code interpretation, or GBK interpretation, or BIG5 interpretation, or ISO8859-1 interpretation? If you explain it according to GBK, you will get the word "Chinese character". According to other coding explanations, the corresponding characters may not be found, or the wrong characters may be found. The so-called "fallacy" means that it is not in line with the original intention of the author of the text.

The answer is that Windows interprets the byte stream in the text file according to the current default code page. The default code page can be configured through the area option of the control panel. There is an ANSI item in notepad save, which is indeed preserved in accordance with the default code page encoding method.

The internal code of Windows is Unicode, which can support multiple code pages at the same time. As long as the file can indicate what kind of encoding it uses, and the user installs the corresponding code page, the Windows will display correctly, just like you can specify charset in the HTML file.

Some HTML authors, especially English authors, think that everyone in the world uses English and does not specify charset in the document. If he uses the characters between 0x80 and 0xff, and the Chinese Windows is interpreted according to the default GBK, it will appear garbled. At this point, just add the statement that specifies charset to the html file, for example:

If the code page used by the original author is compatible with ISO8859-1, it will not appear garbled.

At this point, the study on "what is the difference between GBK, UTF8, GB2312 and UTF-8" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report