What are the knowledge points of Python character coding 04/11 Update SLTechnology News&Howtos

What are the knowledge points of Python character coding

2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article mainly explains "what are the knowledge points of Python character coding". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Next, let the editor take you to learn what are the knowledge points of Python character coding.

1. Character coding

[the so-called unicode]

Unicode is an abstract code similar to a symbol set, which only defines the binary of the symbol, but not how the binary should be stored. That is, it is only an internal representation and cannot be saved directly. So you need to specify a storage form, such as utf-8 and utf-16, etc. In theory, unicode is a coding scheme that can accommodate all languages and characters in the world. (no more details on other coding formats)

[so-called GB code]

GB means "GB", that is, the national standard of the people's Republic of China. GB codes are Chinese character-oriented codes, including GB2312 (GB2312-80) and GBK,GB18030, which represent an increasing range from small to large, and are basically downwards compatible. In addition, we often encounter a code called CP936, which can actually be roughly thought of as GBK.

[judgment coding]

1. Use isinstance (s, str) to determine whether a string is a general string (str is a string of ascii type, utf-8, utf-16, GB2312, GBK, etc. Are all strings of ascii type)

Use isinstance (s, unicode) to determine whether a string is a unicode-encoded string (a unicode-encoded string is a unicode-type string).

2. Use type () or. _ _ class__

In the case of correct coding:

For example: stra = "medium", the result of using type (stra) is a string of type ascii

For example: strb = u "medium", the result of using type (strb) is a string of type unicode.

Tmp_str = 'tmp_str'print tmp_str.__class__ # print type (tmp_str) # print type (tmp_str). _ _ name__ # strtmp_str = u'tmp_str'print tmp_str.__class__ # print type (tmp_str) # print type (tmp_str). _ _ name__ # unicode

3, the best way is to use chardet to judge, especially in web-related operations, such as crawling the content of html pages, the charset tag of the page is just marked coding, sometimes incorrect, and some Chinese in the page content may be beyond the scope of marking coding, it is most convenient and accurate to use charset detection at this time.

(1) installation method: after downloading chardet, place the extracted chardet folder in the\ Lib\ site-packages directory of the Python installation directory, and use import chardet in the program.

(2) method 1: detect all the contents and judge the coding

Import urllib2import chardetres = urllib2.urlopen ('/ / www.jb51.net') res_cont = res.read () res.close () print chardet.detect (res_cont) # {'confidence': 0.99,' encoding': 'utf-8'}

The return value of the detect function is a dictionary containing two key-value pairs, the first is to detect confidence, and the second is the detected encoding form.

(3) method 2: detect part of the content to judge the coding and improve the speed.

Import urllib2from chardet.universaldetector import UniversalDetectorres = urllib2.urlopen ('/ / www.jb51.net') detector = UniversalDetector () for line in res.readlines (): # detect untill reach threshold detector.feed (line) if detector.done: breakdetector.close () res.close () print detector.result# {'confidence': 0.99,' encoding': 'utf-8'}

[transcoding]

1. Convert the specific code (ISO-8859-1 [ASCII code], utf-8,utf-16,GBK,GB2312, etc.) to unicode, and directly use unicode (s, charset) or s.decode (charset), where charset is the code of s (note that unicode will make errors when using decode ())

# convert any string to unicodedef to_unicode (s, encoding): if isinstance (s, unicode): return s else: return unicode (s, encoding)

Note: if you encounter illegal characters in decode () (such as the nonstandard full-width space\\ xa3\\ xa0, or\\ xa4\\ x57, the real full-width space is\\ xa1\\ xa1), an error will be reported.

Solution: adopt the 'ignore' mode, that is, stra.decode ('...', 'ignore'). Encode (' utf-8').

Explanation: decode's function prototype is decode ([encoding], [errors='strict']), and you can use the second parameter to control the error handling strategy.

The default parameter is strict, which means that an exception is thrown when illegal characters are encountered; if set to ignore, illegal characters are ignored; if set to replace, illegal characters are used. Replaces illegal characters; if set to xmlcharrefreplace, the character reference of XML is used.

2. Convert from unicode to specific encoding, also directly using s.encode (charset), where s is unicode encoding and charset is specific encoding (note that non-unicode will make errors when using encode ())

3. Naturally, if you convert from one specific code to another, you can first decode into unicode and then encode into the final code.

[python command line code (system code)]

Use the locale module that comes with python to detect the command line default code (that is, the system code) and set the command line code:

Import locale#get coding typeprint locale.getdefaultlocale () # ('zh_CN',' cp936') # set coding typelocale.setlocale (locale.LC_ALL, locale='zh_CN.GB2312') print locale.getlocale () # ('zh_CN',' gb2312')

It shows that the internal code of the current system is cp936, which is similar to GBK. In fact, the internal coding of Chinese XP and WIN7 is cp936 (GBK).

[coding in python code]

1. If the string in the python code is not specified, the default encoding is consistent with the encoding of the code file itself. For example: the string str = 'Chinese', if it is in a utf8-encoded code file, it is utf8-encoded; if it is in a gb2312 file, it is gb2312-encoded. So how do you know the encoding of the code file itself?

(1) specify the encoding of the code file yourself: add "#-*-coding:utf-8-* -" to the header of the code file to declare that the code file is utf-8 encoded. At this point, the encoding of the string that is not specified becomes utf-8.

(2) when the encoding of the code file is not specified, the code file is created using the default encoding of python (generally speaking, it is ascii code, which is actually saved as cp936 (GBK) code in windows). Through sys.getdefaultencoding () and sys.setdefaultencoding ('...') To get and set the default encoding.

Import sysreload (sys) print sys.getdefaultencoding () # asciisys.setdefaultencoding ('utf-8') print sys.getdefaultencoding () # utf-8

Combination of (1) and (2) to do an experiment: specify the code file encoding for utf- 8, using notepad++ to open the display is utf-8 without DOM coding; when no code file encoding is specified, notepad++ is used to open the display of ANSI coding (compression coding, the default save encoding form).

(3) how to permanently set the default encoding of python to utf-8? There are two ways:

The first method: edit site.py, modify the setencoding () function, and force it to utf-8

The second method: add a file named sitecustomize.py and store it in the\ Lib\ site-packages directory under the installation directory

Sitecustomize.py is executed by import at site.py, and because sys.setdefaultencoding () is deleted at the end of site.py, you can use sys.setdefaultencoding () on sitecustomize.py.

2. If the string in the python code is assigned an encoding, for example: str = u 'Chinese', the encoding of the string is specified as unicode (that is, the internal encoding of python).

(1) there is a misunderstanding that needs to be noted! If you have the following code in the py file:

Stra = u "in" print stra.encode ("gbk")

According to the above stra is in the form of unicode, directly encode said that gbk coding should be no problem? However, the error "UnicodeEncodeError: 'gbk' codec can't encode character u'\\ xd6' in position 0: illegal multibyte sequence" will be reported during the actual execution.

The reason is that when the python interpreter imports the python code file and executes it, it will first check to see if there is an encoding declaration in the header (for example, # coding:gbk, etc.). If a declaration is found, all the strings in the file will be interpreted in the form of unicode (here, stra will be decoded into unicode encoding 'd6d0' with the default encoding gbk (cp936) and saved), and then when stra.encode (' gbk') is executed, since stra is already unicode encoded and 'd6d0' is within the encoding range of gbk, there will be no error in encoding. If there is no encoding declaration in the file header, the decoding operation in the above process will not be performed (here we will directly use stra's unicode encoding 'd6'), and then when stra.encode (' gbk') is executed, an error will be reported because 'd6' is not in the encoding range of gbk.

(2) to avoid this type of error, it is best to declare the encoding on the header of the code file, or use setdefaultencoding () at the point of trouble each time.

(3) generally speaking, unicode is the internal code of the python interpreter. When all the code files are imported and executed, the python interpreter will first decode the string into unicode using the encoding form you specify, and then carry out various operations. So it's best to use unicode to manipulate strings, regular expressions, reading and writing files, and so on.

[other codes in python]

Encoding of the file system: sys.getfilesystemencoding ()

Input code of the terminal: sys.stdin.encoding

Output code of the terminal: sys.stdout.encoding

At this point, I believe you have a deeper understanding of "what are the knowledge points of Python character coding?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.