How to analyze the problem of Python unicode coding 07/13 Update SLTechnology News&Howtos

How to analyze the problem of Python unicode coding

2025-07-13 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

How to analyze the Python unicode coding problem, I believe that many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

This problem has been solved in python3.0.

Here is a good article to understand this question:

Why did you get the error "UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128)"? This article will study this problem.

The internal representation of strings in Python is unicode encoding, so when doing encoding conversion, we usually need to use unicode as the intermediate encoding, that is, we first decode (decode) other encoded strings into unicode, and then encode them from unicode to another encoding.

The purpose of decode is to convert other encoded strings into unicode encoding, such as str1.decode ('gb2312'), which means to convert gb2312-encoded string str1 to unicode encoding.

The purpose of encode is to convert the unicode encoding into other encoded strings, such as str2.encode ('gb2312'), which means to convert the unicode-encoded string str2 to gb2312 encoding.

Therefore, when transcoding, you must first figure out what the string str is encoded, then decode it into unicode, and then encode it into other codes.

The default encoding of strings in the code is consistent with the encoding of the code file itself.

Such as: sworn 'Chinese'

If it is in a utf8 file, the string is utf8 encoded, and if it is in a gb2312 file, it is encoded as gb2312. In this case, to perform transcoding, you need to first convert it to unicode encoding using the decode method, and then convert it to other encodings using the encode method. Typically, when a specific encoding is not specified, the code file is created using the system default encoding.

If the string is defined like this: sroomu 'Chinese'

The encoding of the string is specified as unicode, that is, the internal encoding of python, regardless of the encoding of the code file itself. Therefore, to do the transcoding in this case, you only need to use the encode method directly to convert it to the specified encoding.

If a string is already unicode, decoding will result in an error, so it is usually necessary to determine whether its encoding is unicode:

Isinstance (s, unicode) # is used to determine whether it is unicode or not

Using non-unicode encoded str to encode will report an error.

How do I get the default code for the system?

#! / usr/bin/env python

# coding=utf-8

Import sys

Print sys.getdefaultencoding ()

The output of this program on English WindowsXP is: ascii

In some IDE, the string output is always garbled or even wrong, which is actually due to the fact that the IDE output console itself cannot display the encoding of the string, not the program itself.

Such as running the following code in UliPad:

Signoru "Chinese"

Print s

Will prompt: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range (128). This is because the console information output window of UliPad on the English WindowsXP is output according to the ascii code (the default code of the English system is ascii), and the string in the above code is encoded by Unicode, so there is an error in the output.

Change the last sentence to: print s.encode ('gb2312')

Can correctly output the word "Chinese".

If the last sentence is changed to: print s.encode ('utf8')

Then output:\ xe4\ xb8\ xad\ xe6\ x96\ x87, which is the result of the console information output window outputting utf8-encoded strings according to ascii encoding.

Unicode (str,'gb2312') is the same as str.decode ('gb2312'), which converts str encoded by gb2312 into unicode coding.

Use str.__class__ to view the encoded form of str

Groups.google.com/group/python-cn/browse_thread/thread/be4e4e0d4c3272dd

Python is a language that is prone to coding problems. So, I write down the following words according to my understanding.

= first of all, there are several concepts to understand. =

* bytes: a representation of computer data. 8-bit binary. Can represent unsigned integers: 0-255. In the following text, "byte stream" is used to represent a string of "bytes".

* character: the English character "abc" or the Chinese character "you, me and him". The character itself does not know how to save it in the computer. Below, we will avoid using the word "string" and use "text" instead.

Shows a string of "characters".

* Encoding (verb): converts "text" into "byte stream" according to a certain rule (this rule is called coding (noun)). (in python: unicode becomes str)

* Decoding (verb): converts "byte stream" into "text" according to certain rules. (in python: str becomes unicode)

* * in fact, anything represented in a computer needs to be encoded. For example, the video is encoded and saved in a file, and it needs to be decoded before it can be viewed.

Unicode:unicode defines the correspondence between a "character" and a "number", but does not specify how the "number" is saved on the computer. (as in C, an integer is both

It can be int or short. Unicode does not specify whether to use int or short to represent a "character")

Utf8:unicode implementation. It uses the "character" and "number" mapping defined by unicode, which in turn dictates how the number is saved on the computer. Other utf16 and so on are all

Unicode implementation.

Gbk: "coding" like utf8. But instead of using the "character" and "number" mapping defined by unicode, it uses a different set of mapping methods. Also, it defines how to use the

Save it on the computer.

= encode,decode method in python =

First of all, know that encode is converted from unicode to str. Decode is the conversion from str to unicode.

Below, u represents a variable of type unicode, and s represents a variable of type str.

U.encode ('...') Basically, it can always be successful, as long as you fill in the correct code. Just like any file can be compressed into a zip file.

S.decode ('...') It is often wrong, because what str is "encoded" depends on the context, and when you decode it, you need to make sure that s is encoded. Like, open the zip text.

Make sure that it is indeed a zip file, not just a zip file with a forged extension.

U.decode (), s.encode () is not recommended. S.encode is equivalent to s.decode (). Encode () first uses the default encoding (usually

Ascii) is converted to unicode for encode.

= about # coding=utf8=

When you write this sentence in the first line of the py file and do save the text according to this encoding, then this sentence has the following functions.

1. So that the lexical analyzer can work properly, and the Chinese in the comments will not be reported wrong.

two。 For u "Chinese" so that literal string can know that the contents of the two quotation marks are utf8 encoded, and then can be correctly converted to unicode

3. "Chinese" for such a literal string, you will know that the content in the middle is utf8 coding, and then it can be correctly converted to other codes or unicode.

Did not finish, the first code so many words, later to add, here is not wiki, too troublesome.

= Python encoding and Windows console =

I found that many beginners make mistakes in the print statement, which involves the output of the console. I don't know linux, so I only talk about the console.

First of all, Windows's console is indeed unicode (utf16_le encoded), or more accurately, outputs text in characters.

However, the execution of the program can be redirected to the file, and the unit of the file is "bytes".

So, for things like the C runtime function printf, the output must have an encoding that converts the text into bytes. It may be to be compatible with 95cr 98

Instead of using unicode's encoding, it's mbcs (not gbk or the like).

Windows's mbcs, also known as ansi, uses different codes in different languages of windows, which is the gb series of codes in Chinese windows.

This results in the same text, which is incompatible in windows of different languages.

Now we know that if you want to output text in the windows console, it must be encoded as "mbcs".

For the unicode variable of python, if you use the print output, the code returned by sys.getfilesystemencoding () is used to change it to str.

If it is a utf8-encoded str variable, then you need print s.decode ('utf8') .encode (' mbcs')

Finally, for the str variable, the contents read by the file file and the content on the network obtained by urllib are all in the form of "bytes".

If they are really a "text", for example, you want to print out to have a look. Then you must know their codes. Then decode becomes unicode.

How to know their codes:

1. Make an appointment in advance. (for example, this text file is saved by yourself with utf8 encoding.)

two。 Agreement. (in # coding=utf8,html in the first line of the python file, etc.)

two。 Guess.

This is very good, but I don't quite understand it.

> convert text to byte stream. (in python: unicode becomes str)

"finally, for the str variable, the contents read by the file file and the network content obtained by urllib are all in the form of" bytes "."

Although the file or web page is text, it is encoded as bytes when saved or transferred, so the file opened with "rb" and the stream read from socket are byte-based.

"if they are really a piece of 'text', for example, if you want to print out and have a look, then you have to know their coding. Then decode it into unicode."

The quoted "text" here is actually a byte stream (bytes), not a real text (unicode). It just means that we know it can be decoded into text.

When decoding, if it is based on the convention, you can read the specified coding of the BOM or python file or the meta of the web page directly from the specified place, and you can decode it correctly.

But now many files / web pages have specified encodings, but the file format actually uses other encodings (for example, py files specify coding=utf8, but you can still save it to the default encoding of ansi-- notepad). In this case, you need to guess the real encoding.

The decoded text only exists in the running environment. If you need to print / save / output to the database / network, you need another encoding process. This encoding has nothing to do with the above encoding, but depends on your choice, but this encoding is not optional, because if the encoded bytes needs to be passed to other people / environment, then if your encoding does not follow the convention. It causes trouble to the next person / environment, so it recurses.

> there is one item that is very easy to misunderstand:

> most people will think that Unicode (in a broad sense) unifies coding, but it is not. Unicode is not the only code, but a general term for a lot of codes. But Unicode under Windows

> (narrow sense) generally refers to UCS2, that is, UTF-16/LE

Unicode is unique as a character set (ucs), and there are many coding schemes (utf).

It is important to distinguish between the concept of characters and bytes. Java has always been like this, Python is starting to do it, and Ruby seems to be in a mess.

Let me say a few words, too. My research on coding is relatively deep. Because garbled codes are often encountered at work, in 2005, I did a special research on coding, and posted an article in the company's publication, and finally formed a textbook, which was told to new employees in the company every year. As a result, the problem of garbled codes encountered in the project can be quickly located and solved.

In theory, from a character to a specific encoding, it will go through the following concepts.

Character set (Abstract character repertoire)

Coded character set (Coded character set)

Character Encoding (Character encoding form)

Character Encoding Scheme (Character encoding scheme)

Character set: even a bunch of abstract characters, such as all Chinese characters. The definition of the character set is abstract and independent of the computer.

Coded character set: a mapping from a subset of integers to abstract elements of a character set. That is, numbering abstract characters. For a character defined in gb2312, each character has an integer corresponding to it. An integer corresponds to only one character. On the contrary, it may not be. The mapping relation mentioned here is the mapping relation in the mathematical sense. The coded character set is also computer-independent. The unicode character set is also at this level.

Character encoding: this is starting to have something to do with computers. The specific representation of the coding point of a coded character set in a computer. In popular terms, it means how to put the integers corresponding to characters into computer memory, or files, or networks. Therefore, different people have different ways to achieve, the so-called ten thousand yards Pentium, refers to this. Gb2312,utf-8,utf-16,utf-32 and others are all on this floor.

Character coding scheme: this is more closely related to the computer. It is closely related to the operating system. The main purpose is to solve the problem of large and small byte order. For UTF-16 and UTF-32

Unicode supports both big-endian and little-endian coding schemes.

Generally speaking, what we call coding is done at the third layer. Specific to a software system, it is very complex.

Browser-apache-tomcat (including jsp encoding, compilation, and file reading within tomcat)-between databases, as long as there is data interaction, coding inconsistencies may occur. If there is no correct decode and encode when reading data, garbled code is common.

After reading the above, have you mastered the method of how to analyze Python unicode coding problems? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.