Code Analysis of bytes example of Python built-in Type 07/16 Update SLTechnology News&Howtos

Code Analysis of bytes example of Python built-in Type

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains the "Python built-in type bytes example code analysis", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "Python built-in type bytes example code analysis"!

1 the relationship between bytes and str

Strings in many languages are represented by an array of characters (or byte sequences), such as C:

Char str [] = "Hello World!"

Since a byte can only represent at most 256 characters, in order to cover many characters (such as Chinese characters), it is necessary to represent a character through multiple bytes, that is, multi-byte coding. However, because there is no coding information maintained in the original byte sequence, careless operation can easily lead to a variety of garbled phenomena.

The solution provided by Python is to use Unicode objects (that is, str objects), Unicode spoken language to represent a variety of characters, regardless of coding. However, in storage or network communication, string objects need to be serialized into byte sequences. To do this, Python provides an additional byte sequence object, bytes.

The relationship between str and bytes is shown in the figure:

Str objects uniformly represent a string and do not need to care about coding; computers deal with storage media and network media through byte sequences, and byte sequences are represented by bytes objects; when storing or transmitting str objects, they need to be serialized into byte sequences, and the serialization process is also a coding process.

2 structure of bytes object: PyBytesObject

C source code:

Typedef struct {PyObject_VAR_HEAD Py_hash_t ob_shash; char ob_sval [1]; / * Invariants: * ob_sval contains space for 'ob_size+1' elements. * ob_ Svar [ob _ size] = = 0. * ob_shash is the hash of the string or-1 if not computed yet. * /} PyBytesObject

Source code analysis:

Character array ob_sval stores the corresponding characters, but the length of the ob_sval array is not ob_size, but ob_size + 1. This means that Python allocates an extra byte to the sequence of bytes to be stored, which is used to save'\ 0' at the end to be compatible with the C string.

Ob_shash: the hash value used to hold the byte sequence. Because calculating the hash value of a bytes object requires traversing its internal character array, it is relatively expensive. So Python chose to save the hash and trade space for time (ubiquitous ideas, hh) to avoid double counting.

The figure is as follows:

3 behavior of bytes objects 3.1 PyBytes_Type

C source code:

PyTypeObject PyBytes_Type = {PyVarObject_HEAD_INIT (& PyType_Type, 0) "bytes", PyBytesObject_SIZE, sizeof (char), / /... & bytes_as_number, / * tp_as_number * / & bytes_as_sequence, / * tp_as_sequence * / & bytes_as_mapping / * tp_as_mapping * / (hashfunc) bytes_hash, / * tp_hash * / /...}

Numeric operation bytes_as_number:

Static PyNumberMethods bytes_as_number = {0, / * nb_add*/ 0, / * nb_subtract*/ 0, / * nb_multiply*/ bytes_mod, / * nb_remainder*/}

Bytes_mod:

Static PyObject * bytes_mod (PyObject * self, PyObject * arg) {if (! PyBytes_Check (self)) {Py_RETURN_NOTIMPLEMENTED;} return _ PyBytes_FormatEx (PyBytes_AS_STRING (self), PyBytes_GET_SIZE (self), arg, 0);}

As you can see, the bytes object only uses the% operator to format the string, not a numeric operation in the real sense (here is actually a little different from the original classification standard, according to which there should be another "format operation", but flexible handling is also necessary):

> b'msg: a =% d, b =% d'% (1,2) b'msg: a = 1, b = 2'

Sequential operation bytes_as_sequence:

Static PySequenceMethods bytes_as_sequence = {(lenfunc) bytes_length, / * sq_length*/ (binaryfunc) bytes_concat, / * sq_concat*/ (ssizeargfunc) bytes_repeat, / * sq_repeat*/ (ssizeargfunc) bytes_item, / * sq_item*/ 0, / * sq_slice*/ 0, / * sq_ass_item*/ 0 / * sq_ass_slice*/ (objobjproc) bytes_contains / * sq_contains*/}

The following five sequential operations are supported by bytes:

Bytes_length: query sequence length

Bytes_concat: merges two sequences into one

Bytes_repeat: repeats the sequence multiple times

Bytes_item: fetches the sequence elements of a given subscript

Bytes_contains: inclusion relationship judgment

Associated operation bytes_as_mapping:

Static PyMappingMethods bytes_as_mapping = {(lenfunc) bytes_length, (binaryfunc) bytes_subscript, 0,}

You can see that bytes supports two operations: get length and slice.

3.2 bytes_as_sequence

Here we mainly introduce the following bytes_as_sequence-related operations

None of the operations in bytes_as_sequence are complex, but there is a "trap". Here we take a look at the problem in terms of bytes_concat operations. The C source code is as follows:

/ * This is also used by PyBytes_Concat () * / static PyObject * bytes_concat (PyObject * a, PyObject * b) {Py_buffer va, vb; PyObject * result = NULL; va.len =-1; vb.len =-1 If (PyObject_GetBuffer (a, & va, PyBUF_SIMPLE)! = 0 | PyObject_GetBuffer (b, & vb, PyBUF_SIMPLE)! = 0) {PyErr_Format (PyExc_TypeError, "can't concat% .100s to% .100s", Py_TYPE (b)-> tp_name, Py_TYPE (a)-> tp_name); goto done } / * Optimize end cases * / if (va.len = = 0 & & PyBytes_CheckExact (b)) {result = b; Py_INCREF (result); goto done;} if (vb.len = = 0 & & PyBytes_CheckExact (a)) {result = a; Py_INCREF (result); goto done } if (va.len > PY_SSIZE_T_MAX-vb.len) {PyErr_NoMemory (); goto done;} result = PyBytes_FromStringAndSize (NULL, va.len + vb.len); if (result! = NULL) {memcpy (PyBytes_AS_STRING (result), va.buf, va.len); memcpy (PyBytes_AS_STRING (result) + va.len, vb.buf, vb.len) } done: if (va.len! =-1) PyBuffer_Release (& va); if (vb.len! =-1) PyBuffer_Release (& vb); return result;}

Bytes_concat source code you can analyze, here directly in the form of a diagram to show, mainly in order to illustrate the "trap". The figure is as follows:

Py_buffer provides a unified interface for operating object buffers to shield the internal differences of different types of objects.

Bytes_concat copies the buffers of the two objects together to form a new bytes object.

The above copying process is relatively clear, but there is a hidden problem-the trap of data copying.

Take merging three bytes objects as an example:

> a = baked abc' > > b = baked def' > > c = blighghi'> result = a + b + c > resultb'abcdefghi'

In essence, this process will be merged twice.

> t = a + b > result = t + c

During this process, both an and b's data are copied twice, as shown below:

It is not difficult to merge n bytes objects, the first two objects need to be copied n-1 times, only the last object does not need to be copied repeatedly, on average, each object needs to be copied about twice. Therefore, the following code:

> result = baked'> > for b in segments: result + = s

The efficiency is very low. We can use join () to optimize:

> result = b''.join (segments)

The join () method is a built-in method provided by the bytes object that can efficiently merge multiple bytes objects. The join method optimizes the data copy: first traverses the objects to be merged and calculates the total length; then creates the target object according to the total length; finally traverses the objects to be merged and copies the data one by one. In this way, each object only needs to be copied once, which solves the trap of repeated copy. (you can check the specific source code for yourself.)

4-character buffer pool

Like small integers, there are only a small number of character objects (that is, single-byte bytes objects), but they are used very frequently, so trading space for time can significantly improve execution efficiency. The source code of the character buffer pool is as follows:

Static PyBytesObject * employees [Uchar _ MAX + 1]

Let's take a look at the use of the character buffer pool from the process of creating a bytes object: the PyBytes_FromStringAndSize () function is a common interface responsible for creating bytes objects. The source code is as follows:

PyObject * PyBytes_FromStringAndSize (const char * str, Py_ssize_t size) {PyBytesObject * op; if (size)

< 0) { PyErr_SetString(PyExc_SystemError, "Negative size passed to PyBytes_FromStringAndSize"); return NULL; } if (size == 1 && str != NULL && (op = characters[*str & UCHAR_MAX]) != NULL) {#ifdef COUNT_ALLOCS one_strings++;#endif Py_INCREF(op); return (PyObject *)op; } op = (PyBytesObject *)_PyBytes_FromSize(size, 0); if (op == NULL) return NULL; if (str == NULL) return (PyObject *) op; memcpy(op->

Ob_sval, str, size); / * share short strings * / if (size = = 1) {characters [* str & UCHAR_MAX] = op; Py_INCREF (op);} return (PyObject *) op;}

The key steps involved in character buffer maintenance are as follows:

Lines 10-17: if the object created is a single-byte object, the corresponding sequence number of the characters array will be used to determine whether the corresponding object has been stored in the buffer, and if so, it will be taken out directly.

Lines 28-31: if the object created is a single-byte object and it has been determined that it is not in the buffer, put it in the corresponding position of the character buffer pool

Thus, when the Python program starts running, the character buffer pool is empty. With the creation of single-byte bytes objects, there are more and more objects in the buffer pool. When the buffer pool has cached the characters bread1', broom2', broom3', brooma', broomb', broomc', the internal structure is as follows:

Example:

Note: here you may get inconsistent results in IDLE and PyCharm, which was also mentioned in previous blogs. The conclusion after looking up the data is that IDLE runs differently from PyCharm. Here I will PyCharm code corresponding to the code object decompilation results to show you, but my understanding of IDLE is still relatively weak, later have the opportunity to give you a detailed supplement of this knowledge (Boxing ~). Here, we should first understand the concept of character buffer, of course, it is also helpful to master the relevant knowledge of bytecode. The following is the result of the PyCharm run:

You can see this blog for an explanation of the following operations:

Python source code learning notes: Python program execution process and bytecode

Example 1:

Let's take a look at the decompilation results: (I omitted the following file path. You should enter the correct path when you experiment.)

> text = open ('D:\\...\\ test2.py'). Read () > result= compile (text,'D:\\...\\ test2.py' 'exec') > import dis > dis.dis (result) 1 0 LOAD_CONST 0 (bachela') 2 STORE_NAME 0 (a) 2 4 LOAD_CONST 0 (bachela') 6 STORE_NAME 1 (b) 3 8 LOAD_NAME 2 (print) 10 LOAD_NAME 0 (a) 12 LOAD_NAME 1 (b) 14 IS_OP 0 16 CALL_FUNCTION 1 18 POP_TOP 20 LOAD_CONST 1 (None) 22 RETURN_VALUE

You can clearly see that the LOAD_CONST instructions on lines 5 and 8 operate on the constant breadacodes with subscript 0, so at this time, an and b correspond to the same object, so let's print it:

> result.co_consts [0] bachela'

Example 2:

To confirm that only single-byte bytes objects are cached, I tried a multi-byte bytes object here, again in the PyCharm environment:

The result is quite unexpected: the multi-byte bytes object is still the same. To test this idea, let's take a look at the decompilation of the code object:

> text = open ('D:\\...\\ test3.py'). Read () > result= compile (text,'D:\\...\\ test3.py' 'exec') > import dis > dis.dis (result) 1 0 LOAD_CONST 0 (baked abc') 2 STORE_NAME 0 (a) 2 4 LOAD_CONST 0 (baked abc') 6 STORE_NAME 1 (b) 3 8 LOAD_NAME 2 (print) 10 LOAD_NAME 0 (a) 12 LOAD_NAME 1 (b) 14 IS_OP 0 16 CALL_FUNCTION 1 18 POP_TOP 20 LOAD_CONST 1 (None) 22 RETURN_VALUE > > result.co_consts [0] broomabc'

As you can see, the result of decompilation is no different from a single-byte bytes object.

Thank you for reading, the above is the content of "Python built-in type bytes instance code analysis". After the study of this article, I believe you have a deeper understanding of the problem of Python built-in type bytes instance code analysis, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.