Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the Python character set and character encoding

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what is the Python character set and character coding". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "what is the Python character set and character coding" can help you solve the problem.

First of all, the basic unit of computer storage is bytes, which is composed of 8 bits. Because there are only 52 English letters including case and several characters, the number will not exceed 256, so one byte can be represented completely. However, with the popularity of computers, more and more non-English characters appear, so that a byte can no longer be expressed. So we can only save the nation by curve. For characters that cannot be represented by one byte, use multiple bytes to represent them.

But there are two problems:

Because each country has its own character encoding, multiple languages are not supported, for example, the Chinese code cannot include Japanese, otherwise it will cause garbled codes.

There are no uniform standards, such as GB2312, GBK, GB18030 and other standards in Chinese.

At this point, let's not go any further, let's first clarify some concepts.

Character set and character coding

I guess there are a lot of kids who don't know the difference between the two. Let's first explain what the so-called character set and character encoding are all about.

Character set: a collection of all characters supported by the system, such as ASCII, GB2312, Big5, and unicode, which belong to the character set. It's just that different character sets can hold different numbers of characters. For example, the ASCII character set does not contain Chinese, while unicode can hold all the characters in the world.

Character encoding: responsible for converting each character into one or more specific numbers acceptable to the computer, which can be understood as numbers, so character coding maintains the correspondence between characters and numbers. And coding is also divided into a variety of, such as ascii, gbk, utf-8 and so on, the character encoding is different, then the number after character conversion is also different, of course, the types of characters that can be converted are also different. For example, ASCII character encoding, it can only convert ASCII characters.

Of course, ASCII is special because it is both a character set and a character encoding. And no matter what encoding is used, the corresponding number of ASCII characters is always the same.

Convert each character in the string to the corresponding number, and you get a sequence of bytes (bytes object). Because the basic units of computer storage and network communication are bytes, the string must be stored or transmitted in the form of a sequence of bytes.

Therefore, the string and the byte sequence are similar to each other to some extent. The string can be encode according to the specified encoding to get the byte sequence, that is, each character will be converted to the corresponding number; the byte sequence can be obtained by the same encoding decode, that is, the corresponding character can be found according to the number.

For example, we write a piece of text, and then when we store it, we must first encode it, that is, after we convert each character into one or more numbers acceptable to the system, that is, the corresponding number, we can store it.

S = "Hello" # encoded with a string of digits print (s.encode ("gbk")) # b'\ xc4\ xe3\ xba\ xc3'

Assuming that there is only the word "Hello" in the text, and gbk is used to encode it when it is stored, then gbk must also be used to decode when reading, otherwise it will not be parsed and an error will be reported. Because the character encoding is different, the corresponding number of the character is also different.

For example, each country has its own character code. If you open a file written on a computer in Japan and open it on a computer in China, it is likely to cause garbled codes. Because the character encoding is different, the corresponding relationship between the character and the number is also different, using different character coding for parsing will certainly be a problem.

But we say that for ASCII characters, no matter which encoding is used, the number they get is fixed. So encoding has no effect on ASCII characters.

S = "abc" print (s.encode ("gbk")) # b'abc'print (s.encode ("gbk"). Decode ("utf-8") # abc# but if it is not an ASCII character, try: s = "Hello" s.encode ("gbk"). Decode ("utf-8") except UnicodeError as e: # wrong report Unable to parse print (e) # 'utf-8' codec can't decode byte 0xc4 in position 0: invalid continuation byte

Here we recall that the bytes object can be created literally, such as b "abc", but b "Han" cannot. The reason is that Han this character is not an ASCII character, then using different character encoding, the corresponding number is different, and in this way Python does not know which kind of encoding we use, so it is not allowed to do so, but need to specify the character encoding manually through the way of "Han" .encode.

But for ASCII characters, no matter which character encoding is used, the resulting number is the same, so Python allows this practice for ASCII characters, such as b "abc". And we see that for Chinese characters, there will be multiple numbers after coding, and each number accounts for 1 byte, so different characters may have different sizes.

This is the end of the introduction to "what is the Python character set and character encoding". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report