Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to understand the coding in Python

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "how to understand the code in Python". In daily operation, I believe many people have doubts about how to understand the coding in Python. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to understand the code in Python". Next, please follow the editor to study!

Question1: what is the problem?

The problem is that our target, if there is no problem in mind to learn, we will not be able to grasp the key points.

The programming environment used in this article is centos6.7,python2.7. We type python in shell to open the python command line and type the following two sentences:

S = "China zg"

E = s.encode ("utf-8")

The question now is: does this code work?

The answer is no, and the following errors will be reported:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range

Please pay attention to the 0xe4 described in the error, which is a breakthrough for us to analyze the error.

I believe many people have encountered this mistake. So here comes the new problem.

Question 2:Why?

To find out why, we might as well carefully analyze the implementation process of these two sentences:

First, we typed Chinese zg into the python command line interpreter through the keyboard and appended it in English double quotation marks, and then assigned a value to the variable s, which looks ordinary, doesn't it? In fact, there is a lot of mystery in it.

When we enter characters into the program through the keyboard, we accomplish this function through the operating system. The Chinese zg we see on the screen is actually a feedback from the operating system to us humans, telling you, "Hey, man, you typed the character Chinese zg in the program."

What is the feedback from the operating system to the program? The answer is 01 string. What does this 01 string look like and how is it generated?

The answer is that the operating system uses its own default coding method to encode the Chinese zg and string the encoded 01 to the program.

The default code of the centos system we use is utf-8, so as long as we know the utf-8 code of each character in Chinese zg, we can know what the 01 string is.

After the query, you can get that their codes are (in hexadecimal and binary):

China zg

E4B8ADE59BBD7A67

11100101 10011011 1011110111100101 10011011 101111010111101001100111

Now we know what the 01 string passed by the operating system to the program looks like. Then, what will the program do with it?

When the program sees that the 01 string is surrounded by double quotes, it naturally knows that the 01 string is a string. Then the string is assigned to s.

So far, it is the executive logic of the first sentence.

Now proceed with the execution of the second sentence.

E = s.encode ("utf-8") means to encode the string s with utf-8 and assign the encoded string to e. The problem is that the program now knows the 01 string in s and that the 01 string represents a string, but what is the encoding of this string? We must know the existing encoding of the 01 string in order to parse the characters in it, and then we can re-encode it with a new encoding, such as utf-8. The operating system only sent 01 strings to the program and did not tell the program what the character encoding of the 01 string was.

At this point, the python program will use its own default code as the code of s to identify the contents of s. The default encoding is ASCII, so it interprets the 01 string with ASCII, identifies the contents of the string, and converts the string to utf-8 encoding.

All right, the first byte the program encounters is E4 (11100101), silly! There is no such thing in ASCII encoding because the first byte in ASCII encoding is 0.

What shall I do?

Report the mistake, so we see the mistake above.

The 0xe4 in the error is the first byte of the utf8 encoding of the character "in".

Question 3:How?

Know what the problem is, how to solve it?

Obviously, as long as we tell the program that the code of the 01 string in this s is utf-8, the program should work correctly.

But one problem with this solution is that it is not universal enough.

If I have a program that reads a lot of text files, and the encoding of each text file is different, doesn't it maintain an encoding information for each file read in? It's tedious.

Furthermore, wouldn't it be more troublesome if the contents of these text files have to do operations such as comparing and connecting with each other, and the coding is inconsistent?

How did python solve this problem cleverly?

It's very simple, it's decode!

Decode means that you have a string and you know its encoding. As long as you decode the string with that encoding, python will recognize the contents of the characters. At the same time, create an int array and store the unicode number of each character.

All strings do this to ensure that strings obtained from all sources have the same representation while the program is running. They can easily do all kinds of operations.

The int array mentioned above is encapsulated into an object by python, that is, the unicode object.

Question 4: how to get it done?

Next, we enter the following two lines of code on the python command line:

E = s.decode ("utf-8")

Isinstance (eJournal Unicode)

The output of the program is True, which means that the e returned after decode is indeed a unicode object.

Unicode is a class here, a class in python.

E is called a unicode string, which means that it stores the unicode sequence number of a character and does not use any encoding.

We can then encode e into any kind of encoding, such as the following

E.encode ("utf-8")

E.encode ("gbk")

As long as the encoding you choose can encode the characters in e, if you can't encode it, an error will be reported.

For example, if you try this:

E.encode ("ascii")

Because ASCII does not encode the Chinese characters, encode error will be exposed.

So far, we have seen two kinds of errors, decode error and encode error, and solved them.

Question 5: how to evaluate this character encoding method of python?

First of all, this method of processing is very simple. Any text, as long as it enters the program once decode, will become a unicode object, which uses int to store the unicode sequence number of each character. Just encode again when the text is to be output and encode it into the encoding we need.

The question is, is it a waste of space to represent all characters with an int? After all, with ASCII coding, English characters only need one byte.

It is true that there is a little bit of space, but now the memory is large enough, and we only use this method within the program, and we will encode the string when it is written to a file or transferred over the network.

Another question is, what about the strings that are written to death in the program? Does it have to be decode every time I use it? Different operating systems use different coding by default. When we are in linux, we usually need to use utf8 to do decode, and in Windows, we usually need to use gbk to do decode. In this way, our code can only run on a specific platform.

Python provides us with a very simple way, as long as a u in front of the string, it will help us detect the coding of the system and automatically complete the decode.

Question 6: to sum up, what have you learned?

Starting with a common error, this paper analyzes the coding problems in python in detail. We see the simplicity of python's handling of character problems, and we can understand why python has such powerful text processing capabilities.

Test question: see if you really understand.

Suppose there is a file a.txt on a linux, which contains "Chinese" two characters, and the encoding mode is utf-8.

Now, write the following statement in the python program:

Import codec

S = ""

With codec.open ("a.txt", encoding= "utf-8") as f:

S=f.readline () .strip ()

With open ("b.txt", "w") as f:

F.write (s)

Can this code be executed? Why?

Answer: no!

The representation under s is unicode, and python will encode it when it is written out. The default ascii encoding cannot encode the two characters "Chinese", so it will report an error!

At this point, the study of "how to understand the code in Python" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report