Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the coding problems in Python

2025-04-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what are the coding problems in Python". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what are the coding problems in Python"?

ASCII

Every novice who does JavaWeb development will encounter garbled problems, and every novice who does Python crawler will encounter coding problems. Why does the coding problem hurt so much? This problem begins with the creation of Python by Guido van Rossum in 1992. What Guido never expected was that Python would be so popular today, and that computers would develop at such an amazing speed, although Guido did not need to care about coding when he designed the language, because in the English-speaking world, the number of characters is very limited. 26 letters (uppercase and lowercase), 10 numbers, punctuation marks, control characters, that is, the characters corresponding to all the keys on the keyboard add up to just over 100 characters. this is more than enough to represent a character in a computer with a byte of storage space, because one byte is equivalent to 8 bits, and 8 bits can represent 256 symbols. So smart Americans developed a set of character coding standards called ASCII (American Standard Code for Information Interchange), each character corresponds to a unique number, such as character A corresponding to the binary value is 01000001, the corresponding decimal system is 65. At first, ASCII defined only 128 character encodings, including 96 words and 32 control symbols. A total of 128 characters only need 7 bits of one byte to represent all characters, so ASCII only uses the last 7 bits of one byte, with the highest bit 0. The correspondence between each character and the ASCII code can be found on the website ascii-code.

EASCII (ISO/8859-1)

However, when computers slowly spread to other Western European regions, they found that there were many characters unique to Western Europe that were not found in the ASCII coding table, so later appeared an extensible ASCII called EASCII, which, as the name implies, expanded from 7 bits to 8 bits on the basis of ASCII. It is fully compatible with ASCII, and the expanded symbols include table symbols, computational symbols, Greek letters and special Latin symbols. However, the era of EASCII is a chaotic era, and there is no unified standard. They have each implemented their own set of character coding standards according to their own standards. The more famous one is CP437, and CP437 is the character coding used in Windows systems, as shown below:

Another widely used EASCII is ISO/8859-1 (Latin-1), which is the standard of a series of 8-bit character sets jointly developed by the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). ISO/8859-1 only inherits the characters encoded by CP437 characters, so it is defined from 160. unfortunately, these many ASCII extended word sets are not compatible with each other.

GBK

With the progress of the times, computers began to spread to thousands of households. Bill Gates realized the dream of having a computer on everyone's desktop. But one of the problems that computers have to face when entering China is character coding. Although Chinese characters in our country are the most frequently used by human beings, there are tens of thousands of common Chinese characters, which has greatly exceeded the range of characters that can be expressed by ASCII coding. Even EASCII seems to be a drop in the bucket, so smart Chinese people have made a set of codes called GB2312, also known as GB0. 1981 is issued by the General Administration of Standards of China. GB2312 code contains a total of 6763 Chinese characters, while it is also compatible with the emergence of ASCII,GB 2312, which basically meets the needs of computer processing of Chinese characters. the Chinese characters it contains have covered 99.75% of the frequency used by Chinese mainland, but GB2312 still can not meet the needs of Chinese characters. It is impossible to deal with some rare and traditional characters GB2312. Later, a code called GBK was created on the basis of GB2312. GBK contains not only 27484 Chinese characters, but also Tibetan, Mongolian, Uyghur and other major minority languages. GBK is also compatible with ASCII coding. English characters are represented by 1 byte and Chinese characters are identified by two bytes.

Unicode

We can set up a separate mountain on how to deal with Chinese characters and formulate a set of coding standards according to our own needs, but computers are not only used by Americans and Chinese, but also words from other countries in Europe and Asia, such as Japanese and Korean, add up to hundreds of thousands of words all over the world, which is far beyond the scope of ASCII code and even GBK. And why do people adopt your GBK standard? What is the best way to represent such a huge character library? So the International Organization of Unification Alliance put forward the Unicode code, and the scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", which is called UCS for short. Unicode has two formats: UCS-2 and UCS-4. UCS-2 is encoded in two bytes, a total of 16 bits, which theoretically can represent up to 65536 characters, but it is far from showing that all the characters in the world show 65536 digits, because there are nearly 100000 Chinese characters alone, so the Unicode4.0 specification defines a set of additional character encodings, UCS-4 is using 4 bytes (actually only 31 bits are used, the highest bit must be 0). In theory, it can cover the symbols used in all languages. Any character in the world can be represented by a Unicode code, and once the Unicode code of the character is determined, it will not be changed. But Unicode has some limitations. When a Unicode character is transmitted or finally stored on the network, it does not necessarily need two bytes for each character, such as a character "A", a character that can be represented by one byte, but also two bytes, which is obviously a waste of space. The second question is that when a Unicode character is saved in the computer, it is a string of 01 digits, so how does the computer know whether a 2-byte Unicode character represents a 2-byte character or two 1-byte characters? if you don't tell the computer in advance, then the computer will be confused. Unicode only specifies how to encode, not how to transmit and save the code. For example, the Unicode code of the word "Han" is 6C49. I can use four ascii digits to transmit and save this code, or I can use the three consecutive bytes of utf-8 code E6B89 to represent it. The key is to be recognized by both sides of the communication. Therefore, Unicode coding can be implemented in different ways, such as UTF-8, UTF-16, and so on. Unicode here is just like English, as a universal standard for communication between countries, each country has its own language, and they translate standard English documents into their own language, which is realized, just like utf-8.

UTF-8

As an implementation of Unicode, UTF-8 (Unicode Transformation Format) is widely used in the Internet. It is a kind of variable length character coding, which can represent a character with 1-4 bytes according to the specific situation. For example, English characters, which can already be represented by ASCII codes, require only one byte of space when represented in UTF-8, which is the same as ASCII. For multibyte (n-byte) characters, the first n of the first byte is set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are set to 10. The rest of the binary bits are filled with the unicode code of the character.

Taking the Chinese character "good" as an example, the corresponding Unicode of "good" is 597D, and the corresponding interval is 0000 0800-0000 FFFF, so it needs to be stored in 3 bytes when it is expressed in UTF-8. 597D is expressed in binary: 0101100101111101. When filled into 1110xxxx 10xxxxxx 10xxxxxx, 111001010101010111101 is converted into hexadecimal: e5a5bd, so the UTF-8 code corresponding to "good" Unicode "597D" is "E5A5BD".

Good Chinese unicode 0101 100101 111101 Encoding rules 1110xxxx 10xxxxxx 10xxxxxx-utf-8 11100101 10100101 10111101-hexadecimal utf-8 e 5 a 5 b dPYTHON character coding

Now we've finally finished the theory. Let's talk about the coding problem in Python. Python was born much earlier than Unicode, and the default encoding of Python is ASCII.

> import sys > sys.getdefaultencoding () 'ascii'

So if the encoding is not specified explicitly in the Python source code file, a syntax error will occur

# test.pyprint "Hello"

The above is the test.py script. Running python test.py will package the following error:

File "test.py", line 1

YntaxError: Non-ASCII character'\ xe4' in file test.py on line 1, but no encoding declared; see http://www.python.org/

Ps/pep-0263.html for details

In order to support non-ASCII characters in the source code, the encoding format must be specified on the first or second line of the source file:

# coding=utf-8

Or:

#! / usr/bin/python#-*-coding: utf-8-*-

The data types related to strings in python are str and unicode, both of which are subclasses of basestring. It can be seen that str and unicode are two different types of string objects.

Basestring /\ /\ str unicode

For the same Chinese character "good", when it is represented by str, it corresponds to the'\ xe5\ xa5\ xbd', 'encoded by utf-8, while when represented by unicode, its corresponding symbol is u'\ u597, which is equivalent to u "good". It is important to add that the specific encoding format of str type characters is UTF-8 or GBK, or other formats, depending on the operating system. For example, on a Windows system, the cmd command line shows:

# windows Terminal > a = 'good' > > type (a) > a'\ xba\ xc3'

What is displayed on the command line of the Linux system is:

# linux Terminal > type (a) > a'\ xe5\ xa5\ xbd' > bounu' good'> > type (b) > bu'\ u597d'

Whether it is Python3x, Java or other programming languages, Unicode coding has become the default coding format of the language, and when the data is finally saved to the media, different media can be used in different ways, some people like to use UTF-8, some people like to use GBK, it doesn't matter, as long as the platform has a unified coding specification, the specific implementation does not care.

Conversion between str and unicode

So how is the conversion between str and unicode in Python? The conversion between these two types of string types depends on these two methods decode and encode.

# convert from str to unicodes.decode (encoding) = > to # convert from unicode to stru.encode (encoding) = > to > c = b.encode ('utf-8') > type (c) > c'\ xe5\ xa5\ xbd' > d = c.decode ('utf-8') > > type (d) > > du'\ u597d'

This'\ xe5\ xa5\ xbd' is a string of UTF-8-encoded str type obtained by unicode u 'good' encoded by the function encode. And vice versa, c of type str is decoded into the unicode string d through the function decode.

Str (s) and unicode (s)

Str (s) and unicode (s) are two factory methods that return a str string object and a unicode string object, respectively. Str (s) is an abbreviation for s.encode ('ascii'). Experiment:

> S3 = u "Hello" > s3u'\ u4f60\ u597d' > str (S3) Traceback (most recent call last): File ", line 1, in UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range

The above S3 is a string of unicode type, and str (S3) is equivalent to executing s3.encode ('ascii'). Because the two Chinese characters "Hello" cannot be represented by ascii code, it is wrong to specify the correct code: s3.encode (' gbk') or s3.encode ("utf-8") will not have this problem. Similar unicode has the same error:

> S4 = "Hello" > unicode (S4) Traceback (most recent call last): File ", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0: ordinal not in range >

Unicode (S4) is equivalent to s4.decode ('ascii'), so for a correct transformation, you must correctly specify its encoding s4.decode (' gbk') or s4.decode ('utf-8').

Garbled code

All the reasons for garbled codes can be attributed to the inconsistency of the encoding format used in the encoding process of characters after different encoding and decoding, such as:

# encoding: utf-8 > a'\ xe5\ xa5\ xbd' > b=a.decode ("utf-8") > bu'\ u597d' > c=b.encode ("gbk") > c'\ xba\ xc3' > print c subscription

The character 'good' encoded by utf-8 takes up 3 bytes. After decoding it into Unicode, if you use gbk to decode it, there will be only 2 bytes long. Finally, there is the problem of garbled code, so the best way to prevent garbled code is to always use the same encoding format to encode and decode the characters.

Other skills

For strings such as unicode (of type str):

S ='id\ u003d215903184\ u0026index\ u003d0\ u0026st\ u003d52\ u0026sid'

Converting to a real unicode requires the use of:

S.decode ('unicode-escape')

Test:

> s ='id\ u003d215903184\ u0026index\ u003d0\ u0026st\ u003d52\ u0026sid\ u003d95000\ u0026i' > > print (type (s)) > s = s.decode ('unicode-escape') > > su'id=215903184&index=0&st=52&sid=95000&i' > print (type (s)) >

The above code and concepts are based on Python2.x.

Thank you for your reading, these are the contents of "what are the coding problems in Python?" after the study of this article, I believe you have a deeper understanding of the coding problems in Python, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report