In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-23 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains "the method tutorial of Python coding specification". Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the method tutorial of Python coding specification.
1. Str and bytes in Python 3
In Python3, there are two types of strings, str and bytes.
Today, let's talk about the difference between the two:
Unicode string (str type): stored in the form of Unicode code points, the form of human cognition
Byte string (bytes type): stored in byte form, in the form of machine awareness
All the strings you define in Python 3 are of unicode string type, which can be distinguished by using type and isinstance
# python3 > str_obj = "Hello" > type (str_obj) > isinstance ("Hello", str) True > isinstance ("Hello", bytes) False >
While bytes is a binary sequence object, as long as you add a b before defining a string, it means that you want to define a string object of type bytes.
# python3 > > byte_obj = b "Hello World!" > > type (byte_obj) > isinstance (byte_obj, str) False > isinstance (byte_obj, bytes) True >
But when defining a Chinese string, you can't add b directly in front of it. Instead, you should use encode to transfer it.
> byte_obj=b "Hello" File ", line 1SyntaxError: bytes can only contain ASCII literal characters. > str_obj=" Hello "> str_obj.encode (" utf-8 ") b'\ xe4\ xbd\ xa0\ xe5\ xa5\ xbd' > > 2. Str and unicode in Python 2
In Python2, the type of string is different from that of Python3 and needs to be carefully distinguished.
In Python2, there are only two types of strings, unicode and str.
There is only the difference between unicode object and non-unicode object (which should be called str object):
Unicode string (unicode type): stored in Unicode code points form, human cognitive form byte string (str type): stored in byte form, machine cognitive form
When we define a string directly using double or single quotation marks containing characters, it is the str string object, such as this
# python2 > > str_obj= "Hello" > type (str_obj) > str_obj'\ xe4\ xbd\ xa0\ xe5\ xa5\ xbd' > isinstance (str_obj, bytes) True > > isinstance (str_obj, str) True > > isinstance (str_obj, unicode) False > str is bytesTrue
And when we put a u before double or single quotation marks, it means that we are defining a unicode string object, such as this.
# python2 > > unicode_obj = u "Hello" > unicode_obju'\ u4f60\ u597 d'> type (unicode_obj) > isinstance (unicode_obj, bytes) False > isinstance (unicode_obj, str) False > isinstance (unicode_obj, unicode) True3. How to detect the encoding of an object
All characters have corresponding coding values in the unicode character set (English name: code point)
Saving these coding values as binary bytecodes according to certain rules is what we call coding methods, such as UTF-8,GB2312 and so on.
In other words, when we want to persist the string in memory to the hard disk, we have to specify the encoding method, and in turn, when reading, we have to specify the correct encoding method (this process is called decoding), otherwise there will be garbled.
Then the problem arises: when we know the corresponding encoding method, we can decode it normally, but not all the time we can know what encoding method should be used to decode it.
At this point, we will introduce a library of python-chardet, which needs to be installed before using it.
Python3-m pip install chardet
Chardet has a detect method that can predict its encoding format
> import chardet > chardet.detect ('Wechat official account: Python programming time' .encode ('gbk')) {' encoding': 'GB2312',' confidence': 0.99, 'language':' Chinese'}
Why is it a prediction? if you look at the output above, you will see that there is a confidence field that indicates the credibility of the prediction, or the success rate.
But when using it, if you have a small number of characters, you may be "misdiagnosed"). For example, there are only two characters in Chinese, like below, we use gbk coding, but use chardet but recognize it as KOI8-R coding.
> str_obj = "Chinese" > byte_obj = bytes (a, encoding='gbk') # first get a gbk encoded bytes > chardet.detect (byte_obj) {'encoding':' KOI8-R', 'confidence': 0.682639754276994,' language': 'Russian'} > str_obj2 = str (byte_obj, encoding='KOI8-R') > str_obj2' encoding
Therefore, in order to encode the diagnosis accurately, we should use as many characters as possible.
Chardet supports multiple languages, as can be seen from the official documentation (https://chardet.readthedocs.io/en/latest/supported-encodings.html))
4. The difference between encoding and decoding
Encoding and decoding is actually the process of mutual transformation between str and bytes (Python 2 is long gone, here and later only Python 3 is used as an example)
Encoding: the encode method that converts a string object into a binary byte sequence
Decoding: decode method, which converts a binary byte sequence into a string object
So if we do know the encoding format, how can we convert it to unicode?
There are two ways.
The first is to use the decode method directly
> byte_obj.decode ('gbk')' Chinese'>
The second is to use the str class to turn
> str_obj = str (byte_obj, encoding='gbk') > str_obj' Chinese'> 5. How to set file encoding
In Python 2, ASCII encoding is used by default to read, so when we use Python 2, if you have Chinese in your python file, you will get an error.
SyntaxError: Non-ASCII character'\ xe4' in file demo.py
The reason is that the ASCII coding table is too small to explain Chinese.
In Python 3, uft-8 is used by default to read, so it saves a lot of trouble.
There are usually two solutions to this problem:
The first method
In python2, you can use the header to specify
It can be written like this, although it looks good.
#-*-coding: utf-8-*-
But it's too troublesome to write like this. I usually use the following two ways to write.
# coding:utf-8# coding=utf-8
The second method
Import sys reload (sys) sys.setdefaultencoding ('utf-8')
Here, reload (sys) is executed before calling sys.setdefaultencoding ('utf-8') to set the default decoding method, which is necessary, because python will delete the sys.setdefaultencoding method after loading sys, and we need to reload sys to call the sys.setdefaultencoding method.
At this point, I believe that everyone on the "Python coding specification method tutorial" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.