In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-30 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Editor to share with you python how to solve the problem of Chinese coding garbled code, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!
What is character coding?
In order to solve the problem of character coding completely, we have to understand what character coding is. In essence, the computer only knows the 0 and 1 in the binary system, and it can be said that the actual physical expression of any data in the computer is 0 and 1. If you take the hard disk apart, you can't see the so-called numbers 0 and 1. All you can see is a smooth and shiny disk. If you use a large enough magnifying glass, you can see that there are countless uneven components on the surface of the disk. * * concave stands for 0 and prominently stands for 1 binary * this is the way computers represent binary.
1.ASCII
Now we are faced with the first problem: how to make human language, such as English, be understood by computers? Let's take English as an example. There are English letters (uppercase and lowercase), punctuation marks and special symbols. If we give these letters and symbols a fixed number, and then convert these numbers into binary, then the computer can obviously read these symbols correctly, and through these numbers, the computer can also convert binary to numbered characters and display them to humans to read. This gives rise to the ASCII code that we are most familiar with. ASCII codes use a specified combination of 7-bit or 8-bit binary numbers to represent 128 or 256 possible characters. In this way, in most cases, the conversion between English and binary becomes much easier.
2.GB2312
However, although computers are invented by Americans, people all over the world are using computers. Now there is another problem: how to make Chinese understood by computers? This is troublesome. Chinese is not made up of fixed alphabetical arrangements like Latin. ASCII code obviously can not solve this problem. In order to solve this problem, the General Administration of Standards of China issued "Chinese character coding set for Information Exchange" in 1980 and put forward GB2312 coding, which is used to solve the problem of Chinese character processing. In 1995, the extended Standard for coding of Chinese characters (GBK) was promulgated. GBK is compatible with the internal code standard corresponding to the GB 2312-1980 national standard, and supports all Chinese, Japanese and Korean (CJK) characters of ISO/IEC10646-1 and GB 13000-1 at the lexical level, totaling 20902 characters. In this way, we have solved the problem of computer processing Chinese characters.
3.Unicode
Now the English and Chinese problems have been solved, but new problems have emerged. There are so many countries in the world, there are not only English, Chinese but also Arabic, Spanish, Japanese, Korean and so on. Is it difficult to make a code for every language? Based on this situation, a new code was born: Unicode. Unicode, also known as Unicode and Universal Code, sets a unified and unique binary code for each character in each language to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode supports Europe, Africa, the Middle East and Asia (including unified East Asian pictographic characters and Korean phonetic characters). In this way, whether you use English or Chinese, Japanese or Korean, it is included in the Unicode code and corresponds to a unique binary code. So everyone is happy, as long as everyone uses Unicode coding, then there is no problem of transcoding, what kind of characters can be parsed.
4.UTF-8
However, because Unicode contains more characters, it is conceivable that its parsing efficiency is much lower than that of ASCII code and GB2312, and because Unicode extends the ISO Latin-1 character set by adding a high byte, when these high byte bits are 0, the low byte is the ISO Latin-1 character. Using Unicode for characters that can be represented in ASCII is not efficient because Unicode takes up twice as much space as ASCII, and high byte zeros are of no use to ASCII. To solve this problem, there are some character sets in intermediate formats, which are called universal conversion formats, namely UTF (Unicode Transformation Format). And our most commonly used UTF-8 is one of these conversion formats. We're not going to look at how UTF-8 improves efficiency here, you just need to know the relationship between them.
Summary:
1. In order to process English characters, an ASCII code is generated.
two。 In order to process Chinese characters, GB2312 is generated.
3. In order to deal with national characters, Unicode is generated.
4. In order to improve the storage and transmission performance of Unicode, UTF-8 is produced, which is an implementation of Unicode.
II. Character coding in Python2
1. The default character encoding in Python2 is ASCII code, that is, when Python processes data, as long as the data does not specify its encoding type, Python defaults to processing it as ASCII code. The most direct manifestation of this problem is that when we write a python file that contains Chinese characters, it will prompt an error at run time. As shown in the figure:
The reason for this problem is that Python2 will treat the contents of the entire python script as ASCII codes. When Chinese characters appear in the script, such as "Xiao Ming" here, we know that ASCII codes cannot handle Chinese characters, so this error occurs. The solution is to add a line of encoding declaration to the header of the file, as shown in the figure:
#-*-coding: utf-8-*-
In this way, Python will use the UTF-8 encoding to process the entire script when dealing with the script, so that the Chinese characters can be parsed correctly.
2. There are two types of strings in Python2: str and unicode.
The above figure shows two types of strings in Python2:
The name variable is given a string "Xiaoming"
Unicode_name is the unicode format of name variables. Here we use the decode () method, which we will explain in detail later.
The two return different byte strings in the terminal, and type returns different data types, but print prints out the same output.
Here we notice the name of a "byte string". A byte string refers to the standard form of the string in python, that is, no matter what kind of encoding a string is, there will be a string of bytes to represent it in python. The byte string is unencoded and corresponds to the form of data that is eventually handed over to the computer for processing.
3. The byte string of unicode can be viewed directly in Python2.
In the figure above, the return value of the input unicode_name is a unicode byte string, which we can see directly. In python3, we will not be able to see the unicode byte string directly, it will be displayed as the Chinese "Xiaoming"; because python3 uses unicode encoding by default, the unicode byte string will be directly processed to be displayed in Chinese.
Summary:
1. The default character encoding in Python2 is ASCII.
2. There are two types of strings in Python2: str and unicode. Str has a variety of coding differences, unicode is the standard form of no coding.
3. The byte string of unicode can be viewed directly in Python2.
3. Decode () and encode () methods
We've said so much to lay the groundwork for this section, and now we're going to deal with the character encoding problem in Python2. First of all, we need to learn the two transcoding methods decode () and encode () provided by Python.
The decode () method converts other encoded characters to Unicode encoded characters.
The encode () method converts Unicode-encoded characters into other encoded characters.
Without saying much, go straight to the picture above:
The chardet module can detect string encoding, and those without it can be installed with pip install chardet.
First of all, explain why "Xiaoming" in name= is a character encoded by utf-8. Because I use the Ubuntu14.04 operating system, the default character encoding of the system is UTF-8, so when I input a Chinese character at the terminal, the system will automatically pass the Chinese character to Python as UTF-8 code. So if your system is windows operating system, and in most cases the system code of windows is gb2312 by default, then the character "Xiaoming" is gb2312 code in the test picture above under windows.
In the figure above, we convert utf-8-encoded name to unicode_name through the decode () method, and then convert unicode_name to gb2312_name through the encode () method. At this point, we use print to output gb2312-encoded characters, resulting in a strange output. This is because my operating system uses UTF-8 encoding, so the characters encoded by gb2312 cannot be parsed correctly. If we put the byte string of the gb2312 under windows, we can get the Chinese we want, as shown in the figure:
The so-called garbled code is essentially caused by the inconsistency between the coding of the system and the characters provided. Let's give an example:
The letter An of utf-8 is stored in Xiaoming's computer, which is 1100001 in the computer.
The letter An of gb2312 is also stored in Xiao Hong's computer, which is 11000010 in the computer.
When Xiao Ming and Xiao Hong exchange information, their respective computers will not recognize the A passed by each other as the letter A, but may think it is the letter B.
So when we need the operating system to output a character correctly, we should not only know the character encoding of the character, but also know the character encoding used by our system. If the system uses UTF-8 encoding, there will be so-called "garbled" characters that are dealing with gb2312.
The effect of a Tips:decode () method is the same as that of adding u to a string, such as u 'Xiaoming'.
Summary:
The conversion of character coding in Python2 should use unicode as the "middleman".
Know the character encoding of your system (Linux default utf-8,Windows default GB2312), and prescribe the right remedy.
4. An example of character coding
Use python2 under the Linux operating system to obtain the title of NetEase's home page and display it in correct Chinese.
The home page of 163uses gb2312 character encoding, while we mentioned earlier that the default character encoding under Linux is UTF-8. Let's test whether direct extraction will cause garbled problems.
We found that the extracted title did not display correctly because it was declared as a gb2312 character encoding in the web page, while the default character encoding in my system was UTF-8. Obviously, I had to convert title to UTF-8 characters.
In fact, because utf-8 belongs to unicode character encoding, we can print out unicode-encoded characters directly in Linux. Such as:
Now we use Python2 to do another experiment in Windows, this time we change it to title on the home page of Baidu:
This time we found that the character code on the web page is utf-8, so will I have garbled code under Windows:
So we emphasize again that the garbled code is essentially caused by the inconsistency between the coding of the system and the characters provided.
There has been a great improvement in character coding in Pyhon3. The main points are as follows:
The default encoding of the Python3 source .py file is UTF-8, so you don't have to write the coding declaration in the py script in Python3, and the characters passed to python by the system are no longer affected by the system default encoding and are unified as unicode encoding.
Make a distinction between a string and a byte sequence. The string str is the standard form of a string similar to unicode in 2.x, and bytes is similar to str in 2.x with various encoding differences. Bytes is converted to str,str by decoding and to bytes by coding.
PS: there is a small problem that is plagued by many beginners. Let's take a look at the picture.
We see that when a Chinese character appears in a list (or tuple, dict), it is not displayed as a Chinese character but as a byte string. However, when the string is extracted from list and then print, it can be displayed normally in Chinese. A byte string is the "essential" form of all characters in python, so you can simply understand that the byte string presented in list is for the computer.
The above is all the contents of the article "how to solve the problem of Chinese coding garbled in python". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.