Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the details of the composition of python string and character encoding

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the composition of python string and the details of character coding". Interested friends may wish to take a look at it. The method introduced in this paper is simple, fast and practical. Next let the editor to take you to learn "the composition of the python string and what are the details of character coding"!

Bytes and characters

All the data stored by the computer, such as text characters, pictures, video, audio and software, are made up of a sequence of 01 bytes, with one byte equal to 8 bits.

And a character is a symbol, such as a Chinese character, an English letter, a number, a punctuation can all be called a character.

Bytes are convenient for storage and network transmission, while characters are used for display and easy to read. For example, the character "p" stored on the hard disk is a string of binary data 01110000, which takes up one byte in length.

Coding and decoding

The text we open with the editor, the characters we see, are eventually saved on disk in the form of a sequence of binary bytes. Then the conversion process from character to byte is called encode, which in turn is called decode, which is a reversible process. Encoding is to store transmission, and decoding is to facilitate display and reading.

For example, the character "p" is encoded and saved to the hard disk is a sequence of binary bytes 01110000, occupying a byte length. The character "Zen" may be stored in the length of "11100111 10100110 10000101" with a length of 3 bytes. Why is it possible? Put this in the back.

Why does the Python code hurt so much? Of course, developers are not to blame.

This is because Python2 uses ASCII character encoding as the default encoding, and ASCII can't handle Chinese, so why not UTf-8? Because Guido's father wrote the first line of code for Python in the winter of 1989, the first version was officially released in February 1991, while Unicode was released in October 1991, which means that UTF-8 had not yet been born when the language Python was founded.

Python makes two types of strings, unicode and str, so that developers are confused. This is the second. Python3 has completely reinvented the string, leaving only one type, which will be discussed later.

Str and unicode

Python2 divides strings into unicode and str types. Str is essentially a sequence of binary bytes. The following sample code shows that the "Zen" of type str is printed out as hexadecimal\ xec\ xf8, and the corresponding binary byte sequence is' 11101100 11111000'.

The unicode symbol corresponding to u "Zen" of unicode type is u'\ u 7985'.

If we want to save unicode symbols to a file or transfer them to the network, we need to encode and convert them to str types, so python provides an encode method to convert unicode to str, and vice versa.

Many beginners can not remember the conversion between str and unicode with encode or decode, if you remember that str is essentially a string of binary data, and unicode is a character (symbol), encoding (encode) is the process of converting characters (symbols) into binary data, so the conversion from unicode to str uses the encode method, and vice versa, using the decode method.

Encoding always takes a Unicode string and returns a bytes sequence, and decoding always takes a bytes sequence and returns a Unicode string ".

Now that we have a clear conversion relationship between str and unicode, let's see when UnicodeEncodeError and UnicodeDecodeError errors occur.

Error log

UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-7: ordinal not in range

Why is there UnicodeEncodeError?

Because when calling the write method, Python first determines what type the string is, and if it is str, it writes directly to the file without encoding, because the string of type str itself is a sequence of binary bytes.

If the string is of type unicode, it first calls the encode method to convert the unicode string to a binary str type before saving it to the file, while the encode method uses the python default ascii code to encode

However, we know that the ASCII character set contains only 128 Latin letters, excluding Chinese characters, so there is an error of 'ascii' codec can't encode characters'. To use encode correctly, you must specify a character set that contains Chinese characters, such as UTF-8, GBK. So to write the unicode string to the file correctly, you should convert the string to UTF-8 or GBK in advance. Of course, there is more than one way to correctly write a unicode string to a file, but the principle is the same, not to mention here, the same principle for writing a string to a database and transferring it to the network.

UnicodeDecodeError

When a byte sequence'\ xe7\ xa6\ x85' generated after UTF-8 encoding is decoded and converted into a unicode string with GBK decoding, UnicodeDecodeError occurs, because (for Chinese characters) GBK encoding takes up only two bytes, while UTF-8 takes up three bytes, and when converted with GBK, there is an extra byte, so it cannot be parsed. The key to avoiding UnicodeDecodeError is to keep the same type of encoding and decoding.

This also answers the character "Zen" at the beginning of the article, which may account for 3 bytes or 2 bytes when saved to the file, depending on the encoding format specified by encode.

When str performs the + operation with unicode strings, Python implicitly converts (decodes) a sequence of bytes of type str to the same unicode type as x, but Python is converted using the default ascii encoding, and ASCII does not contain Chinese, so an error is reported. The correct way is to decode y explicitly with UTF-8 or GBK.

At this point, I believe you have a deeper understanding of "what is the composition of the python string and the details of character coding". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report