What is Python character encoding? 07/04 Update SLTechnology News&Howtos

What is Python character encoding?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

In this article Xiaobian for you to introduce in detail "what is Python character coding", the content is detailed, the steps are clear, the details are handled properly, I hope this "what is Python character coding" article can help you solve your doubts, the following is to follow the editor's ideas slowly in-depth, together to learn new knowledge.

1. Introduction to character coding 1.1. ASCII

ASCII (American Standard Code for Information Interchange) is a single-byte encoding. At the beginning of the computer world, there was only English, and a single byte can represent 256 different characters, all English characters and many control symbols. However, ASCII uses only half of them (under\ x80), which is the basis on which MBCS can be implemented.

1.2. MBCS

However, there will soon be other languages in the computer world, and single-byte ASCII can no longer meet the demand. Later, each language developed its own set of encodings. Because there are too few characters represented by single words and need to be compatible with ASCII encodings, these codes use multi-bytes to represent characters, such as GBxxx, BIGxxx and so on. Their rule is that if the first byte is less than\ x80, it still represents ASCII characters. If it is more than\ x80, it represents a character together with the next byte (two bytes in total), and then skips the next byte and continues to judge.

Here, IBM invented a concept called Code Page, which pockets these codes and allocates page numbers. GBK is page 936, or CP936. Therefore, you can also use CP936 to represent GBK.

MBCS (Multi-Byte Character Set) is a general term for these codes. So far everyone has used double bytes, so it is sometimes called DBCS (Double-Byte Character Set). It must be clear that MBCS is not a specific encoding. MBCS refers to different codes in Windows depending on the area you set, while MBCS cannot be used in Linux. You can't see the characters MBCS in Windows, because Microsoft uses ANSI to scare people in order to be more foreign-style, and the ANSI in the save as dialog box in notepad is MBCS. At the same time, in the default locale of Windows in simplified Chinese, it refers to GBK.

1.3. Unicode

Later, some people began to think that too much coding made the world too complicated and painful, so they sat together and patted their heads and came up with a way: the characters of all languages are represented by the same character set, which is Unicode.

The original Unicode standard UCS-2 used two bytes to represent a character, so you can often hear that Unicode uses two bytes to represent a character. But after a while, someone thought that 256x256 was too little and still not enough, so the UCS-4 standard appeared, which uses 4 bytes to represent a character, but we still use UCS-2 the most.

UCS (Unicode Character Set) is just a table of the corresponding code points of characters. For example, the code point of the word "Han" is 6C49. UTF (UCS Transformation Format) is responsible for exactly how characters are transmitted and stored.

At first this is very simple, directly use the code point of UCS to save, this is UTF-16, for example, "Han" directly use\ x6C\ x49 save (UTF-16-BE), or reverse use\ x49\ x6C save (UTF-16-LE). But Americans feel that they have suffered a great loss. In the past, the English alphabet only needed one byte to be preserved, but now the meal in the big pot has become two bytes, and the space consumption has doubled. So UTF-8 came out of nowhere.

UTF-8 is an awkward encoding, as shown in its length and compatibility with ASCII,ASCII characters represented by 1 byte. However, the savings here must have been picked out from somewhere else. You must have heard that Chinese characters in UTF-8 are saved with 3 bytes, right? The 4-byte character is even more tearful. (how did UCS-2 become UTF-8? please search by yourself.)

Another thing worth mentioning is BOM (Byte Order Mark). When we save the file, the encoding used in the file is not saved, but when we open it, we need to remember the encoding we used to save it and open it with this code, which creates a lot of trouble. (you might want to say that notepad did not allow the code to be selected when opening the file? You might as well open notepad and then use file-> open to see) while UTF introduces BOM to represent its own encoding. If the first few bytes read are one of them, then the encoding used for the next text to be read is the corresponding encoding:

BOM_UTF8'\ xef\ xbb\ xbf'

BOM_UTF16_LE'\ xff\ xfe'

BOM_UTF16_BE'\ xfe\ xff'

Not all editors will write BOM, but it can be read even without BOM,Unicode, just like MBCS encoding, you need to specify a specific encoding, otherwise decoding will fail.

You may have heard that UTF-8 does not need BOM, which is not true, but most editors use UTF-8 as the default encoding when they do not have BOM. Even notepad that uses ANSI (MBCS) by default when saving, uses UTF-8 test encoding when reading the file, and UTF-8 decoding if it can be successfully decoded. The awkward practice of notepad creates a BUG: if you create a new text file and type "colorful" and then use ANSI (MBCS) to save it, and then open it to "Han a", you might as well try:)

2. Coding problem 2. 1 in Python2.x. Str and unicode

Both str and unicode are subclasses of basestring. Strictly speaking, str is actually a byte string, which is a sequence of encoded bytes of unicode. When you use the len () function on the UTF-8-encoded str' Han', the result is 3, because in fact, the UTF-8-encoded 'Han' = ='\ xE6\ xB1\ x89'.

Unicode is the real string, which is obtained after decoding the byte string str with the correct character encoding, and len (u 'Han') = = 1.

Let's take a look at the example methods of encode () and decode (). After understanding the difference between str and unicode, the two methods will no longer be confused:

# coding: UTF-8 u = u 'Han' print repr (u) # u'\ u6c49's = u.encode ('UTF-8') print repr (s) #'\ xe6\ xb1\ x89'u2 = s.decode ('UTF-8') print repr (U2) # u'\ u6c49' # decoding unicode is an error # S2 = u.decode ('UTF-8') # similarly, encoding str is also an error # U2 = s.encode (' UTF-8')

It is important to note that although it is wrong to call the encode () method on str, in fact Python does not throw an exception, but instead returns another str; with the same content but a different id, the same is true for unicode to call the decode () method. I don't understand why we put encode () and decode () in basestring instead of unicode and str, but since that's the case, let's be careful not to make mistakes.

2.2. Character encoding declaration

In the source code file, if it is useful for non-ASCII characters, you need to declare a character encoding in the header of the file, as follows:

#-*-coding: UTF-8-*-

In fact, Python only checks the #, coding, and encoded strings, and the other characters are added for beauty. In addition, there are many character encodings available in Python, and there are many aliases that are not case-sensitive, such as UTF-8 that can be written as U8. See also

It is also important to note that the declared encoding must be the same as the one used when the file was actually saved, otherwise there is a high probability that a code parsing exception will occur. Today's IDE usually handles this situation automatically, changing the declaration to the declared encoding at the same time, but text editors need to be careful:)

2.3. Read and write files

When the built-in open () method opens the file, read () reads str, which requires decode () with the correct encoding format. When write () is written, if the parameter is unicode, you need to use the encoding you want to write for encode (). If it is a str of other encoding formats, you need to use the encoding of the str for decode (), and then convert it to unicode and then use the written encoding for encode (). If you pass unicode as an argument directly to the write () method, Python will be encoded using the character encoding declared in the source file and then written.

# coding: UTF-8 f = open ('test.txt') s = f.read () f.close () print type (s) # # known to be GBK encoding, decoded into unicodeu = s.decode (' GBK') f = open ('test.txt',') # encoded into UTF-8 encoded strs = u.encode ('UTF-8') f.write (s) f.close ()

In addition, the module codecs provides an open () method that specifies an encoding to open the file, and the file opened using this method will return unicode. When writing, if the parameter is unicode, the encoding specified when open () is used for encoding and then written; if it is str, it is first encoded and decoded to unicode according to the characters declared in the source code file, and then the above operation is performed. This method is less prone to coding problems than the built-in open ().

# coding: GBK import codecs f = codecs.open ('test.txt', encoding='UTF-8') u = f.read () f.close () print type (u) # f = codecs.open (' test.txt', 'asides, encoding='UTF-8') # write unicodef.write (u) # write str Automatic decoding and encoding operation # GBK-encoded strs = 'Han' print repr (s) #'\ xba\ xba'# here will first decode the GBK-encoded str to unicode and then encode it to UTF-8 and write f.write (s) f.close () 2.4. Coding-related methods

The sys/locale module provides some ways to get the default encoding in the current environment.

Import sysimport locale def p (f): print'% s% s ():% s'% (f)% (f)% (f)) # returns the default character encoding p (sys.getdefaultencoding) used by the current system, gets the default locale and returns meta-ancestor (language) Encoding) p (locale.getdefaultlocale) # returns the text data encoding set by the user # the document mentions that this function only returns a guessp (locale.getpreferredencoding) #\ xba\ xba is the GBK encoding of 'Han' # mbcs is not recommended Only tests are made here to show why print r "\ xba\ xba'.decode ('mbcs'):", repr ('\ xba\ xba'.decode ('mbcs')) # results on the author's Windows (locale set to Chinese (simplified, China)) # sys.getdefaultencoding (): gbk#sys.getfilesystemencoding (): mbcs#locale.getdefaultlocale (): (' zh_CN') 'cp936') # locale.getpreferredencoding (): cp936#'\ xba\ xba'.decode (' mbcs'): U'\ u6c49'3. Recommendation 3.1. Character encoding declaration

Use character-encoding declarations, and all source code files in the same project use the same character-encoding declaration.

This must be done.

3.2. Abandon str and use unicode all.

Pressing u before pressing quotation marks is really unaccustomed at first and often forgets to run back to make up, but if you do so, you can reduce the coding problem by 90%. If the coding problem is not serious, you can not refer to this article.

3.3. Replace the built-in open () with codecs.open ().

If the coding problem is not serious, you can not refer to this article.

3.4. Character encodings that absolutely need to be avoided: MBCS/DBCS and UTF-16.

By MBCS, I don't mean that you can't use GBK or anything, but don't use the code called 'MBCS' in Python, unless the program is completely unportable.

The coding 'MBCS'' and 'DBCS'' in Python are synonyms that refer to the coding of MBCS references in the current Windows environment. There is no such coding in Linux's Python implementation, so there must be an exception once migrated to Linux! In addition, the coding of MBCS reference is also different as long as the Windows system area is different.

Set up different areas to run the code in section 2.4:

# sys.getdefaultencoding (): gbk#sys.getfilesystemencoding (): mbcs#locale.getdefaultlocale (): ('zh_CN',' cp936') # locale.getpreferredencoding (): cp936#'\ xba\ xba'.decode ('mbcs'): u'\ u6c49' # English (USA) # sys.getdefaultencoding (): UTF-8#sys.getfilesystemencoding (): mbcs#locale.getdefaultlocale (): (' zh_CN') 'cp1252') # locale.getpreferredencoding (): cp1252#'\ xba\ xba'.decode (' mbcs'): U'\ xba\ xba' # German (Germany) # sys.getdefaultencoding (): gbk#sys.getfilesystemencoding (): mbcs#locale.getdefaultlocale (): ('zh_CN' 'cp1252') # locale.getpreferredencoding (): cp1252#'\ xba\ xba'.decode (' mbcs'): U'\ xba\ xba' # Japanese (Japan) # sys.getdefaultencoding (): gbk#sys.getfilesystemencoding (): mbcs#locale.getdefaultlocale (): ('zh_CN',' cp932') # locale.getpreferredencoding (): cp932#'\ xba\ xba'.decode ('mbcs'): u'\ uff7a\ uff7a'

It can be seen that after changing the region, using mbcs decoding got incorrect results, so when we need to use 'GBK', we should just write' GBK', not write 'MBCS'.

UTF-16 similarly, although 'UTF-16'' is a synonym for 'UTF-16-LE'' in most operating systems, writing 'UTF-16-LE'' is only three more characters, and if 'UTF-16'' becomes synonymous with 'UTF-16-BE'' in an operating system, there will be wrong results. In fact, UTF-16 is used very little, but you still need to be careful when you use it.

After reading this, the article "what is Python character coding" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it to understand it. If you want to know more about related articles, welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.