In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
Today, I will talk to you about how to use Python to sort out coding problems, many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
Universal code
In order to solve the problem of language coding in its own way, people put forward the Unicode coding scheme, which is simple and crude: uniformly encode all the language characters in the world. Unicode's two schemes, UCS-2 and UCS-4, have available space of 2 ^ 16 and 2 ^ 32, respectively: there should be enough space before aliens visit Earth.
Let's see what the Unicode code point (code point) of a few characters looks like:
Ls = 'abAB Gong ★☆' print ([ord (l) for l in ls])
Results: [97, 98, 65, 66, 24041, 9733, 9734]. It can be seen that the Unicode code point of the letter abAB is the same as its ASCII code point, so when the character is a letter, the two are compatible, while the code point of Chinese character Gong is 24041 (0x5de9), which is different from the previous GB series code 47534 (0xb9ae), so the Unicode and GB series codes are not completely compatible: only ASCII is partially compatible.
After people in all countries use Unicode coding, the problems of extension and garbled no longer exist: all human language characters have a unified code point, and every digital code we write in communication has a unique character corresponding to him. The chr function in Python returns the character corresponding to the Unicode code point.
> print ([chr (I) for i in [123957jr.24041]) ['{','v', 'Gong']
So can we use the powerful Unicode to code?
> ls = 'abAB ★☆' > ls.encode ('Unicode') Traceback (most recent call last): File ", line 1, in LookupError: unknown encoding: Unicode
Unknown code Unicode! This is because there is no Unicode code, Unicode is just a code point table, it just establishes the mapping between characters and integers. As for how the integer code bit (code point) is stored in bytes, high and low bits are first stored, and there are no special marks, Unicode does not directly determine, but to specific coding to consider these details: UTF-32,UTF-16 and UTF-8.
UTF-32 four-byte unit
UTF-32, as its name implies, is an encoding scheme that uses 32 bits, or four bytes, to store a character.
> 'aAgong' .encode ('utf-32LE') bachela\ x00\ x00\ x00A\ x00\ x00\ xe9\ x5d\ x00\ x00'
As you can see, all characters are stored in four bytes: each byte cannot be filled with\ x00 except for the Unicode code point. This method is simple and clear, Unicode code points do not need to be converted, directly filled. But a large number of\ X00 caused a great deal of waste. Is there any way to solve this waste? Can we use two digits under compression?
UTF-16 is a two-byte unit
When using UTF-16 to encode.
> 'aAgong' .encode ('utf-16LE') bachela\ x00A\ x00\ xe9\ x5d'
Two bytes are sufficient for most Unicode code points, and if not, the system is automatically represented by four bits. This is the implementation of the system, we do not need to care. The byte sequences and characters encoded by UTF-16 can still correspond one by one. UTF-16 actually has two encoding methods, which are UTF-16LE in the above example and UTF-16BE in the following example. Test:
> 'aAgong' .encode ('utf-16BE') b'\ x00a\ x00A\ x5d\ xe9'
The two are basically the same, except that the positions of high and low bytes are reversed. The LE and BE suffixes indicate small byte order (little endian) and large byte order (big endian). This is the specific implementation detail of whether the MSB (heavily weighted bytes) of the byte is placed at the beginning or the end of the byte.
In Gulliver's travels, the citizens of Lilliputian countries first ate big or small heads in order to eat eggs. Tit for tat, they formed two military opposition groups, big endians and little endians, and waged war with each other many times.
So two bytes is the limit of Unicode coding?
UTF-8 variable length byte coding
Can you store text with a variable number of bytes? If you store English text, you can use only one byte for each character; if you have Chinese characters, you can expand them. In this way, storage space is further saved. The answer is yes, this is variable length coded UTF-8.
> 'aAgong' .encode ('utf-8') b'aA\ xe5\ xb7\ xa9'
This is by far the shortest sequence of bytes because the aA is stored as one byte each. It should be noted that in UTF-32 and UTF-16, Gong's byte sequence is 0x5de9, but in UTF-8, the byte sequence becomes 0xe5b7a9. This shows that UTF-8 coding does not simply store Unicode code points directly into byte sequences, but carries out some transformations. These conversions ensure one-bit storage in English and multi-byte storage of larger characters such as Chinese. So how is it converted?
UTF-8 transcoding rules
This part is too detailed and can be skipped. UTF-8 implements variable length coding. In order to distinguish how long the variable length is, special templates are needed in the byte sequence. UTF-8 coding follows the following rules:
The code points between 0x00 and 0x7F are compatible with ASCII codes. A single byte is directly stored in the following template:
Between 0x80 and 0x7ff, two bytes are used for storage. The byte template is 110 * 10 bytes *.
Between 0x800 and 0xffff, three bytes are used for storage, and the byte template is 1110 * 10 bytes * 10 bytes *
Between 0x10000 and 0x1fffff, four bytes of storage are used. The byte template is 1111 bytes * 10 bytes * 10 bytes *
Take the Chinese character Gong as an example, its Unicode code bit is 0x6c49, and the binary bit is 110 1100 0100 1001. It is in the range of the third line, so you need three bytes to store it. Write out the template, 1110 * 10 bytes * 10 bytes * Let's go back to the length of the three UTF-8 encodings from the details of UTF transcoding.
The length of UTF after three kinds of coding
For the above three encoding methods, the file length is also different because the compression ratio is not used. The following program compares the length of the three different encodings when the text is Chinese character and English content:
Es = 'abcdefghij' cs =' Don't worry that there are no intimate friends in front of you. There is no one in the world who doesn't appreciate you. Codes = ['utf-32le','utf-16le','utf-8'] print ([len (es.encode (code)) for code in codes]) print ([len (cs.encode (code)) for code in codes])
The output is [40, 20, 10] [64, 32, 48]. It can be seen that for English, UTF-8 has an advantage over both UTF-16 and UTF-32 coding; for Chinese characters, the most advantage is UTF16 coding. This is because in UTF-16 coding, most Chinese characters are stored in 2Byte, while Chinese characters in UTF-8 need three bytes of storage. In daily life, UTF-8 is the most widely used because of maximum compatibility. So far, we have evolved from ASCII codes to GB series codes, to Unicode and corresponding UTF series codes, and have a character coding system with all-inclusive codes, no clutter and high compression ratio. Is it ready to use? No! Because we only encode the text itself, we don't record which code is used: when we send a file, the other person doesn't know what code to use to open it unless we tell it to each other. To solve this problem, we will leave it to the next article for analysis.
Unicode unifies the characters of all languages in the world. Among the several coding forms of Unicode
UTF-32 is simple, but wasteful.
UTF-16 saves space by using two bytes of storage.
UTF-8 uses one byte of direct storage, which is a balance of efficiency and space.
After reading the above, do you have any further understanding of how to sort out the coding problem with Python? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.