Str source code analysis of Python built-in type 07/09 Update SLTechnology News&Howtos

Str source code analysis of Python built-in type

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Python built-in type str source code analysis", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in depth, together to study and learn "Python built-in type str source code analysis" bar!

1 Unicode

The basic unit of computer storage is bytes, which consists of eight bits. Since English consists of only 26 letters plus several symbols, English characters can be saved directly in bytes. However, other languages (such as China, Japan and South Korea) have to use multiple bytes to encode because of their large number of characters.

With the spread of computer technology, non-Latin character coding technology continues to develop, but there are still two big limitations:

Does not support multiple languages: an encoding scheme in one language cannot be used in another language

There is no uniform standard: for example, there are many coding standards in Chinese, such as GBK, GB2312, GB18030 and so on.

Due to the different coding methods, developers need to convert back and forth between different codes, and there will inevitably be a lot of errors. In order to solve this kind of disunity, the Unicode standard is proposed. Unicode organizes and encodes most of the world's text systems so that computers can process text in a uniform way. Unicode currently contains more than 140000 characters and naturally supports multiple languages. (Unicode's uni is the root of "unity".)

2 benefits of the Unicode2.1 Unicode object in Python

Python after 3, the str object is represented internally by Unicode, so it becomes a Unicode object in the source code. The advantage of using Unicode is that the core logic of the program unifies the use of Unicode, and it only needs to be decoded and encoded at the input and output layer, which can avoid all kinds of coding problems to the greatest extent.

The figure is as follows:

2.2 Optimization of Unicode by Python

Problem: since Unicode contains more than 140000 characters, each character needs at least 4 bytes to save (here, because 2 bytes is not enough, only 4 bytes are used, usually 3 bytes are not used). While it takes only 1 byte for English characters to be represented by ASCII code, using Unicode will make the cost of frequently used English characters quadruple.

First, let's look at the differences in the size of different forms of str objects in Python:

> sys.getsizeof ('ab')-sys.getsizeof (' a') 1 > sys.getsizeof ('one')-sys.getsizeof ('one') 2 > sys.getsizeof ('?)-sys.getsizeof ('?) 4

Thus it can be seen that the Unicode object is optimized internally in Python: the underlying storage unit is selected based on the text content.

The underlying storage of Unicode objects is divided into three categories according to the range of Unicode code points of text characters:

PyUnicode_1BYTE_KIND: all character code points are between Ubun0000 and U+00FF

PyUnicode_2BYTE_KIND: all characters have code points between Ubun0000 and U+FFFF, and at least one character has a code point greater than U+00FF

PyUnicode_1BYTE_KIND: all characters have code points between Ubun0000 and U+10FFFF, and at least one character has a code point greater than U+FFFF

The corresponding enumerations are as follows:

Enum PyUnicode_Kind {/ * String contains only wstr byte characters. This is only possible when the string was created with a legacy API and _ PyUnicode_Ready () has not been called yet. * / PyUnicode_WCHAR_KIND = 0 Return values of the PyUnicode_KIND macro: * / PyUnicode_1BYTE_KIND = 1, PyUnicode_2BYTE_KIND = 2, PyUnicode_4BYTE_KIND = 4}

Select different storage units according to different categories:

/ * Py_UCS4 and Py_UCS2 are typedefs for the respective unicode representations. * / typedef uint32_t Py_UCS4;typedef uint16_t Py_UCS2;typedef uint8_t Py_UCS1

The corresponding relationship is as follows:

Text type character storage unit character storage unit size (bytes) PyUnicode_1BYTE_KINDPy_UCS11PyUnicode_2BYTE_KINDPy_UCS22PyUnicode_4BYTE_KINDPy_UCS44

Because the Unicode internal storage structure varies depending on the text type, the type kind must be saved as a public field of the Unicode object. Some flag bits are defined inside Python as Unicode public fields: (due to the limited level of the author, all of the fields here will not be introduced in the following content, and you can understand them later. Hold the fist ~)

Interned: whether it is maintained for the interned mechanism

Kind: type, used to distinguish the size of the underlying storage unit of a character

Compact: memory allocation method, whether the object is separated from the text buffer

Asscii: whether the text is pure ASCII

The Unicode object is initialized based on the number of text characters size and the maximum character maxchar through the PyUnicode_New function. This function is mainly based on maxchar to select the most compact character storage unit and underlying structure for Unicode objects: (the source code is relatively long, so it is not listed here, you can understand it yourself, and the following is shown in table form)

Maxchar

< 128128 ob_refcnt + (s->

State? 2: 0) * / static PyObject * interned = NULL;voidPyUnicode_InternInPlace (PyObject * * p) {PyObject * s = * p; PyObject * t transmissionifdef Py_DEBUG assert (s! = NULL); assert (_ PyUnicode_CHECK (s)); # else if (s = = NULL |! PyUnicode_Check (s)) return;#endif / * If it's a subclass, we don't really know what putting it in the interned dict might do. * / if (! PyUnicode_CheckExact (s)) return; if (PyUnicode_CHECK_INTERNED (s)) return; if (interned = = NULL) {interned = PyDict_New (); if (interned = = NULL) {PyErr_Clear (); / * Don't leave an exception * / return }} Py_ALLOW_RECURSION t = PyDict_SetDefault (interned, s, s); Py_END_ALLOW_RECURSION if (t = = NULL) {PyErr_Clear (); return;} if (t! = s) {Py_INCREF (t); Py_SETREF (* p, t); return;} / * The two references in interned are not counted by refcnt. The deallocator will take care of this * / Py_REFCNT (s)-= 2; _ PyUnicode_STATE (s). Interned = SSTATE_INTERNED_MORTAL;}

As you can see, some basic checks are done in front of the source code. We can take a look at lines 37 and 50: when s is added to the interned dictionary, s is actually both key and value (I'm not quite sure why), so the reference count corresponding to s is + 2 (see the source code of PyDict_SetDefault () for details), so it will count-2 at 50 lines to ensure that the reference count is correct.

Consider the following scenario:

> class User: def _ init__ (self, name, age): self.name = name self.age = age > user = User ('Tom', 21) > user.__dict__ {' name': 'Tom',' age': 21}

Since the properties of the object are saved by dict, this means that each User object has to hold a str object 'name', which wastes a lot of memory. Str is an immutable object, so Python makes singletons of potentially duplicated strings into singletons, which is the interned mechanism. The specific method of Python is to maintain a global dict object internally, where all str objects with the interned mechanism enabled are saved. When you need to use them later, create them first. If it is determined that the same string has been maintained, the newly created object will be recycled.

Example:

The 'abc', generated by different operations ends up with the same object:

> > a = 'abc' > b =' ab' +'c'> id (a), id (b), an is b (2752416949872,2752416949872, True) Thank you for your reading. The above is the content of "Python built-in type str source code analysis". After the study of this article, I believe you have a deeper understanding of the problem of Python built-in type str source code analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.