Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the parsing process of the Python parser

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

Today, I will talk to you about the parsing process of the Python parser, which may not be well understood by many people. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.

First of all, let's take a look at the whole process of the Python parser: we first write the source code in the editor and save it into a file. If there is an encoding declaration in the source code and the editor used supports the syntax, the file is saved on disk in the appropriate encoding.

Note: the encoding declaration is not necessarily consistent with the encoding of the source file. You can declare the encoding as UTF-8 in the encoding declaration, but use GB2312 to save the source file. Of course, we can't ask for trouble and make mistakes on purpose, and a good IDE can force consistency between the two, but this problem can happen accidentally if we write code using editors such as notepad or EditPlus.

Once we have a .py file, we can run it, which is, we give the code to the Python parser to do the parsing. When the parser reads into the file, it first parses the encoding declaration in the file. We assume that the encoding of the file is gb2312, then convert the contents of the file from gb2312 to Unicode, and then convert these Unicode to byte strings in UTF-8 format.

(note: only the source code is the pure transcoding of the script code.) after completing this step, the parser segments and parses these UTF-8 byte strings. If you encounter using Unicode strings (for example, u 'China an I love you'), then use the corresponding UTF-8 byte strings to create Unicode strings.

If a normal string is used in the program, the parser first converts the UTF-8 byte string through Unicode into a corresponding encoded byte string (note: normal, non-unicode, that is, ascii) and uses it to create a generic string object. In other words, Unicode strings are not stored in memory in the same format as normal strings, the former using the UTF-8 format and the latter using the GB2312 format.

All right, we know the format of string storage in memory, and now we need to understand how print works. In fact, print is only responsible for giving the corresponding byte strings in memory to the operating system, so that the corresponding programs of the operating system (such as the cmd window) are displayed. There are two situations:

1. If the string is a normal string, then print only needs to push the corresponding byte string in memory to the operating system. Such as code 1 in the example.

2. If the string is a Unicode string, then print performs the corresponding encode before pushing: we can show that the encode method using Unicode uses the appropriate encoding method to encode (code 2 in the example)

Otherwise, Python encodes using the default encoding, which is ASCII (code 3 in the example). Of course, it is impossible for ASCII to encode Chinese correctly, so Python reports an error. At this point, we can analyze the above three questions and the third one. As for the second question, because there are two kinds of strings in Python, general strings and Unicode strings, both have their own character handling methods.

For the former, the method is done in bytes, and in GB2312, each Chinese character occupies two bytes, so the result is 5; for the latter, that is, the Unicode string, all characters are treated uniformly, so it is obtained.

Although only the Chinese problems of the console program are mentioned above, the Chinese problems in file reading and writing and network transmission are similar in principle. The emergence of Unicode can solve the problem of software internationalization to a great extent, and Python provides excellent support for Unicode. Therefore, I suggest that we all use Unicode when writing Python programs.

Use the UTF-8 encoding when saving the file. How to Use UTF-8 with Python has a detailed description, you can refer to. There are still many places in Python that can lead to Chinese problems, such as reading and writing of documents, transmission of network data, and so on. I hope you can communicate more and solve these problems together.

Review the process of using the Python parser: first, write the source code in an editor and save it to a file. If there is an encoding declaration in the source code and the editor used supports the syntax, the file is saved on disk in the appropriate encoding. Note: the encoding declaration is not necessarily consistent with the encoding of the source file. You can declare the encoding as UTF-8 in the encoding declaration, but use GB2312 to save the source file.

Of course, this is self-inflicted trouble, and a good IDE should ensure the consistency of the two. But if. This can happen if you use an editor such as notepad or EditPlus to write code. Once you have a .py file, you can run it. This is to give the code to the Python parser to do the parsing. When the parser reads the file, it first parses the encoding declaration in the file, assuming that the encoding of the file is gb2312.

Then convert the contents of the file from gb2312 to Unicode, and then convert these Unicode into byte strings in UTF-8 format. After completing this step, the parser segments these UTF-8 byte strings and parses them. If you encounter using a Unicode string, create a Unicode string with the corresponding UTF-8 byte string, if you are using a normal string in the program.

Then the parser first converts the UTF-8 byte string through Unicode into a corresponding encoded byte string (in this case, gb2312 encoding), and uses it to create a general string object. In other words, Unicode strings are not stored in memory in the same format as normal strings, the former using the UTF-8 format and the latter using the GB2312 format.

After reading the above, do you have any further understanding of the parsing process of the Python parser? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report