Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The principle and Application of python coding

2025-03-31 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly explains "the principle and use of python coding". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the principle and use of python coding.

Python Encoding depth parsing 1, characters, bytes, Encoding 1.1.Why encode?

Coding is to solve the relationship between characters and bytes.

1.2 Why divide characters and bytes?

The purpose of dividing characters and bytes is to solve the relationship between man and machine. Ordinary people can only read characters and machines can only recognize bytes.

What is the relationship between 1.3 characters and encoding?

The character'a character 'can be recognized by people, but not by machines, and machines can only recognize bytes.

So a technique is needed to convert the character'a' to bytes.

This technique is coding, such as our most common ASCII (American Standard Code for Information Interchange) code, which contains English characters, numbers and some control characters.

Through the ASCII code table, we can find that the number corresponding to the character'a' is 97, so we can encode the character'a' to binary byte 01100001, so that the machine can recognize it.

By coding, we generally refer to the value of the character set and the characters in the character set, rather than the encoding method.

The general coding method is the binary corresponding to the coding of the characters in the character set.

The Unicode character set is different, and there are several commonly used encodings, which are described in more detail later.

1.4 Why are there different codes?

Because different people use different characters, for example, Chinese people use Chinese characters, just like the characters you read now, we can't find the corresponding characters in the ASCII code table, and naturally we can't convert characters into bytes according to ASCII.

So we need new coding methods to include our own characters, from the original GB2312 to GBK to GB18030.

Similarly, other languages have their own character sets and encodings.

II. Unicode

In the process of internationalization, it is certainly not possible to play with your own coding, so you need a unified coding, that is, to include world characters, so you only need to use Unicode codes in the application to avoid adapting to the application.

We know that coding contains the corresponding values of the character set and the characters in the character set, and the general character set and the corresponding values are put together as a code table of the corresponding coding.

Unicode has two character sets, UCS-2 and UCS-4, and because UCS-4 is compatible with UCS-2, UCS-4 is introduced here.

UCS-4 uses 4-byte encoding, and the highest bit is 0, so the 8 bits of the highest byte have 7 significant bits.

UCS-4 uses the highest byte of 7 bits to identify the group (group), that is, 2 ^ 7 = 128groups.

UCS-4 uses the next highest byte to identify the plane (plane), 8 bits of 2 ^ 8 = 256 planes.

UCS-4 uses the third byte to identify the line (row), 8 bits 2 ^ 8 = 256 lines.

UCS-4 uses the fourth byte to identify code points (cell), which is equivalent to 256code points.

Plane 0 of group 0 is called Basic Multilingual Plane (BMP)

In UCS-4, a code point with a high two-byte zero is called BMP,BMP. After removing two high bytes with all zeros, it is a 2-byte UCS-2.

From the above introduction, we can easily calculate: UCS-2 has a total of 2 ^ 16 = 65536 code points.

This is also in python2.x:

Import sysprint sys.maxunicode

The printed value is 65535 (starting at 0, so the maximum code point is 65535).

The Unicode plan uses 17 planes with 65536 code points in each plane, with a total of 17 × 65536114112 code points.

This is also in python3.x:

Import sysprint sys.maxunicode

The reason why the printed value is 1114111

III. The origin of python garbled

Let's first introduce the related coding of the two characters' Chinese'.

Unicode code of 'Chinese':\ u4e2d\ u6587 'UTF-8 code:\ xe4\ xb8\ xad\ xe6\ x96\ x87' Chinese 'GBK code:\ xd6\ xd0\ xce\ xc4

The following code is not written through an interactive program, but is written to a file and executed in cmd.

Print ("Chinese")

Add the contents of the file named test-encoding.py as shown above, and if we execute the file in the Chinese environment windows, we will get the following error.

By looking at the hexadecimal of the test-encoding.py file, we can see that the test-encoding.py file uses UTF-8 encoding, compared with the 'Chinese' code.

We can also see that this is not an execution error, because it is a SyntaxError, which means it is a syntax error.

At the same time, this error also tells us the reason, because there are non-ASCII characters, because the default encoding of python is ASCII, so it cannot contain non-ASCII characters.

To include non-ASCII characters, we need a special handling, which is to tell the python interpreter how the file is encoded.

#-*-coding: utf-8-*-print ("Chinese")

The added comments tell the python interpreter to use utf-8 as the encoding between characters and bytes.

Is it enough to tell the interpreter to use utf-8 coding?

Obviously not, because there are other coding methods involved, for example, if we execute the above file in cmd, we will get the output.

Trickle 

The output value above is the GBK code corresponding to the UTF-8 code\ xe4\ xb8\ xad\ xe6\ x96\ x87 of the two characters' Chinese'. You can check against the GBK code table yourself, or you can use the online transcoding tool to convert and verify.

Because the default code page for windows is 936, which is GBK encoding.

It can be solved in the following 2 ways:

#-*-coding: utf-8-*-print (u "Chinese") print (a.decode ('utf-8'). Encode (' gbk')) IV. Confusing default encoding

The default encoding for python2.x is ASCII python3.x. The default encoding is UTF-8.

You can view the default encoding through sys.getdefaultencoding ()

So what is the default coding method?

In some coquettish operations, you may be taught to use the following ways to solve garbled code:

#-*-coding: utf-8-*-import sysprint (sys.getdefaultencoding () reload (sys) sys.setdefaultencoding ('utf-8') print (sys.getdefaultencoding ()) a =' Chinese 'print (a)

However, this is of no use, because it is a problem at the output end.

So where is this coquettish operation useful?

The answer is when dealing with unicode:

#-*-coding: utf-8-*-import sysprint (sys.getdefaultencoding ()) # reload (sys) # sys.setdefaultencoding ('utf-8') with open (ringing F:\ tmp\ test.txt',') as f: f.write ('test') f.write (u 'test')

Execute the above code and you get the following error (2.x):

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range

But uncomment the comment code and there will be no problem.

Let's analyze: f.write (str) receives a parameter of type str, and u' test'is a type of unicode.

Unicode to str requires encode, and the conversion between bytes and characters needs to be encoded, which is obviously used here:

U' test '.encode (sys.getdefaultencoding ()) 5. Python interactive coding

The encoding used by python interactively reads the system code by default. In Chinese windows, it is usually GBK.

So when we use an interactive approach, there is generally no garbled problem.

6. Python handles web page file coding UnicodeEncodeError: 'gbk' codec can't encode character u'\ xa0' in position 392477: illegal multibyte sequence

Sometimes you encounter the above error, which is basically due to the command output to cmd, that is, it is basically caused by print ('xxx'), where xxx is an GBK incompatible character.

In fact, we can easily construct a similar error (2.x):

Print ('utf-8' .decode ("utf-8")) # print ('room' .decode ("decode"). Encode ("gbk"))

Execute the file containing the above code on the windows command line in the Chinese environment and you will see the UnicodeEncodeError error.

In Unicode coding: basic Chinese character range: 4E00-9FA5 (20902) basic Chinese character supplement: 9FA6-9FEF (74) Unicode code is: 9FA6 GBK does not contain the word 9FA6.

Solution: when this error occurs, it means that it contains characters that are outside the range of GBK, so don't use GBK encoding.

Chinese windows command line default GBK, do not modify the code page must have this problem, only modify the code page can be solved.

In another way, it's not good to output directly to a file and use UTF-8 coding.

7. Summarize python 2.x

# an is a string (str) type a = 'Chinese' # b is a unicode type b = u 'Chinese'

The differences between the two are:

The str type uses the system default encoding (settable)

The type of unicode uses the encoding of unicode

Conversion: str can get unicode type through decode. Unicode can get str type through encode.

2.x ASCII is used by default to convert the code between characters and bytes, and the character set is UCS-2.

Because 2.x does not have a byte type, the conversion between different encodings is first stored as unicode encoding, and then what encoding is needed, and then converted from unicode encoding to corresponding encoding.

Python 3.x

It is much more normal than the 2.xPol 3.x version, and is consistent with the logic of other languages.

The first and most important thing is to add the byte type, that is, the byte type.

Unicode encoding is no longer displayed, but complies with:

Character encoding byte # character-- > encoding-- > byte bs = "Chinese" .encode ('utf-8') # byte-- > encoding-- > character ch = bs.decode ("utf-8")

3.x utf-8 is used by default to encode the conversion between bytes and characters, and the character set used is UCS-4.

8. The ultimate solution to garbled code

There is something wrong with the coding of the garbled byte-> character.

So first of all, we have to figure out what kind of encoding is used for the bytes in the file and network, which are ASCII, GBK, UTF-8.

Then you have to figure out what the terminal and editor of the output character is, and then match it: decode and encode use the same encoding in pairs.

Some friends may say, isn't this nonsense? I know what the code is, and it will be garbled?

The crux of the question is where the bytes come from and what encoding is used by the editor or terminal that displays the characters. This can at least locate most of the problems.

As for those that have been converted many times in the middle, they need to be carefully investigated.

At this point, I believe you have a deeper understanding of "the principle and use of python coding". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report