Example Analysis of c # string Encoding 07/04 Update SLTechnology News&Howtos

Example Analysis of c # string Encoding

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

Xiaobian to share with you a sample analysis of c # string coding, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's learn about it!

1. ASCII code

We know that inside the computer, all the information is ultimately represented as a binary string. Each binary bit (bit) has 0 and 1 states, so eight binary bits can be combined into 256 states, which is called a byte. In other words, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 00000000 to 11111111.

In the 1960s, the United States formulated a set of character coding, which made unified provisions on the relationship between English characters and binary bits. This is called ASCII code and has been used to this day. The ASCII code specifies a total of 128 characters, such as the space "SPACE" is 32 (binary 00100000) and the uppercase letter An is 65 (binary 01000001). The 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly specified as 0.

In C #, if you want to see what the ASCII code of a letter is, you can use the class Encoding, which represents character encoding, as follows:

String s = "a"

Byte [] ascii = Encoding.ASCII.GetBytes (s)

We can see through the debugger that ascii is 97, which means that the ASCII code of an is 97 (1100001).

Second, non-ASCII coding

128 symbols is enough to encode English, but 128 symbols is not enough to represent other languages. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of idle bytes to incorporate new symbols. For example, the French word é has a code of 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.

However, new problems have emerged here. Different countries have different letters, so even if they all use 256 symbols, they represent different letters. For example, 130 stands for é in French, the letter Gimel in Hebrew and another symbol in Russian. But in any case, in all these coding methods, the symbol represented by 0-127 is the same, except for the paragraph 128-255.

As for the characters of Asian countries, more symbols are used, and the number of Chinese characters is about 100000. One byte can only represent 256 symbols, which is certainly not enough, so multiple bytes must be used to express a symbol. For example, the common coding method in simplified Chinese is GB2312, which uses two bytes to represent one Chinese character, so theoretically it can represent at most 256x256=65536 symbols. In C #, if you want to see the GB2312 code of a Chinese character, you can use the following code:

String s = "beam"

System.Text.Encoding GB2312 = System.Text.Encoding.GetEncoding ("GB2312")

Byte [] gb = GB2312.GetBytes (s)

There are two numbers in gb: 193,186 (10111010).

III. Unicode

As mentioned above, there are many ways to encode in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know how to encode it, otherwise it will be garbled if you interpret it in the wrong way. Why do emails often have garbled codes? It is because the sender and the recipient use different coding methods.

It is conceivable that if there is a code that includes all the symbols in the world. Each symbol is given a unique code, and the garbled problem disappears. This is Unicode, and as its name suggests, this is the coding of all symbols.

Unicode is, of course, a large collection, and its current scale can hold more than 1 million symbols. The coding of each symbol is different. In C #, if you want to see the Unicode code of a Chinese character, you can use the following code:

String s = "beam"

Byte [] unicode = Encoding.Unicode.GetBytes (s)

There are two numbers in unicode: 129,104 (1101000).

IV. The problem of Unicode

It is important to note that Unicode is just a set of symbols, which only specifies the binary of the symbol, but not how the binary should be stored.

For example, the unicode of the Chinese character "Liang" is (110100010000001), which means that the representation of this symbol requires at least 2 bytes. Represents other larger symbols, which may require 3 or 4 bytes, or more.

There are two serious problems here. The first one is, how can you tell unicode from ascii? How does the computer know that three bytes represent one symbol instead of three symbols separately? The second problem is that we already know that only one byte of English letters is enough. if unicode uniformly stipulates that each symbol is represented by three or four bytes, then there must be two to three bytes 0 in front of each English letter, which is a great waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.

The result is: 1) there are many ways to store unicode, that is, there are many different binary formats that can be used to represent unicode. 2) unicode could not be promoted for a long time until the emergence of the Internet.

5. UTF-8

With the popularity of the Internet, there is a strong demand for a unified coding method. UTF-8 is the most widely used way to implement unicode on the Internet. Other implementations include UTF-16 and UTF-32, which are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable length coding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

UTF-8 's coding rules are simple, with only two:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of the symbol. So for the English alphabet, the UTF-8 code and the ASCII code are the same.

2) for n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are all set to 10. The rest of the unmentioned binary bits are all unicode codes for this symbol.

Conversion relationship between Unicode and UTF-8

UCS-2 encoded UTF-8 byte stream

Umur00000000-Umur0000007F: 0xxxxxxx

Umur00000080-U-000007FF: 110xxxxx 10xxxxxx

Umur00000800-U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

Ulym 00010000-U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Umur00200000-U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Umur04000000-U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Give me an example.

We use the code.

String s = "beam"

Byte [] unicode = Encoding.Unicode.GetBytes (s)

Byte [] utf8 = Encoding.UTF8.GetBytes (s)

With the debugger, you can see

Here in the memory of the data is arranged from high to low, 104 hexadecimal for 68129 hexadecimal for 81, that is to say, "beam" unicode is 6881 hexadecimal, binary for 110100010000001, we can find from the table above, 6881 should belong to the three rows (800-FFFF), so the "beam" UTF-8 code requires three bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary bit of the beam, fill in the x in the format from back to front in turn, and the extra bit complement 0. Thus, the UTF-8 code of "Liang" is "111001101010001010000001", and the conversion from 8 bits to decimal is 230162129. It is exactly the same as the value in utf8 in the figure above.

VI. C # UTF-8 to GB2312

All the strings in memory in NET are Unicode, so the test program is not easy to write under the console application. Please write it yourself according to the following code:

String UTF8ToGb2312 (string str) {string gb2312info = string.Empty; Encoding utf8 = Encoding.UTF8; Encoding gb2312 = Encoding.GetEncoding ("gb2312"); byte [] unicodeBytes = utf8.GetBytes (str); byte [] asciiBytes = Encoding.Convert (utf8, gb2312, unicodeBytes); char [] asciiChars = new char [gb2312.GetCharCount (asciiBytes, 0, asciiBytes.Length)] Gb2312.GetChars (asciiBytes, 0, asciiBytes.Length, asciiChars, 0); gb2312info = new string (asciiChars); return gb2312info;}

VII. Advantages of UTF8

UTF-8 is a universal language coding in the world. If you visit the gb2312-encoded website under the operating system of other languages, you need to download the language package, so for the sake of the versatility of the website, using UTF8 coding is a better choice, but in comparison, gb2312 gets less data than UTF-8.

8. Garbled code problem:

If you have a string in memory, file, or e-mail, you should know what encoding scheme it uses, otherwise it cannot be correctly interpreted or displayed to the user. If there is no corresponding encoding worthy of equivalence in the coding scheme you are trying to use, a small question mark is usually displayed. Or display a box All the strings in memory in NET are Unicode, while the asp.net program is encoded by UTF-8 by default. We have garbled codes when we use some strings. First of all, we have to judge whether the encoding we explained is wrong.

The above is all the contents of the article "sample Analysis of c # string Encoding". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.