How to store strings in Python 03/29 Update SLTechnology News&Howtos

How to store strings in Python

2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "how to store strings in Python". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to store strings in Python" can help you solve the problem.

Three kinds of coding of unicode

Starting with Python3, the string uses Unicode. Depending on the encoding, each character of Unicode can take up to 4 bytes, which is sometimes expensive from a memory point of view.

To reduce memory consumption and improve performance, three encodings are used internally in Python to represent Unicode:

Latin-1 encoding: one byte per character

UCS2 encoding: two bytes per character

UCS4 encoding: 4 bytes per character

In Python programming, the behavior of all strings is consistent, and most of the time we don't notice the difference. However, when dealing with large text, this difference can become extremely significant, even unexpected.

To see the differences in internal representations, we use the sys.getsizeof function to see how many bytes an object occupies.

Import sysprint (sys.getsizeof ("a")) # 50print (sys.getsizeof (Han)) # 76print (sys.getsizeof ("Han")) # 80

We see that they all have the same character, but they take up different amounts of memory. Because Python uses different encodings for different characters, which leads to different sizes.

But it is important to note that each string in Python takes up an additional 49-80 bytes because it stores some additional information, such as common headers, hashes, length, byte length, encoding type, and so on.

Import sys# for ASCII characters, one occupies 1 byte, obviously at this time the code is Latin-1 code print (sys.getsizeof ("ab")-sys.getsizeof ("a")) # characters for Chinese characters, Japanese, etc., one occupies 2 bytes This is the UCS2 code print (sys.getsizeof ("Han")-sys.getsizeof ("Han") # 2print (sys.getsizeof ("Han Han")-sys.getsizeof ("Han")) # like emoji, which occupies 4 bytes, and the UCS4 code print (sys.getsizeof ("Han")-sys.getsizeof ("Han")) # 4

With different encodings, the extra parts of the underlying structure instance will also take up different amounts of memory.

If the encoding is Latin-1, then the extra part of the structure instance will take up 49 bytes; the encoding is UCS2, accounting for 74 bytes; and the encoding is UCS4, accounting for 76 bytes. Then the number of bytes occupied by the string is equal to: the extra part + the number of characters * the bytes occupied by a single character.

Import sys# so an empty string occupies 49 bytes # this time uses the least memory Latin-1 encoding print (sys.getsizeof (")) # 4 encoding now uses UCS2print (sys.getsizeof (Han)-2) # 7 empty UCS4print (sys.getsizeof (" Han ")-4) # 76 Why not use utf-8 encoding

The three encodings mentioned above are used by Python at the bottom, but we know that unicode also has a utf-8 encoding, so why not use Python?

Let's throw a question first: first of all, we know that Python supports finding characters at a specified position in a string by index, and Python defaults to characters, not bytes (we'll talk about it later). For example, s [2] searches for the third character in the string s.

S = "Gu Ming Di Jue" print (s [2]) #

So the question is, we know that finding a character in a string through the index has a time complexity of O (1), so how does Python locate the specified character instantly through the index?

Obviously, through the offset of the pointer, multiply the index by the number of bytes occupied by each character to get the offset, and then offset a specified number of bytes from the head, so that you can locate the specified character while ensuring a time complexity of O (1).

But this requires a premise: the size of each character in the string must be the same, and if the size of the character is different, such as 1 byte or 3 bytes, it is obviously impossible to offset through the pointer. At this time, if you still want to locate accurately, you can only scan all the characters one by one in order, but then the time complexity is definitely not O (1), but O (n).

Let's take Go as an example. The string of Go is the utf-8 encoding used by default:

Package mainimport ("fmt") func main () {s: = "ancient sense" fmt.Println (s [2]) / 164fmt.Println (string (s [2])) / / perception}

We were surprised to see that what we printed was not what we wanted. Because the underlying Go uses utf-8 encoding, different characters may occupy different bytes. But when Go is located through the index, the time complexity is also O (1), so the location is in bytes, not characters. Only one byte, not one character, will be fetched when fetching.

So s [2] in Go refers to the third byte, not the third character, while Chinese characters occupy 3 bytes under utf-8 coding, so s [2] refers to the third byte of ancient Chinese characters. When we see that when we print, the value of this byte is 164.

S = "Gu Ming Di Xue" print (s.encode ("utf-8") [2]) # 164

This is the disadvantage of using utf-8 coding, it does not allow us to accurately locate characters with O (1) time complexity, although it is more memory-saving when storing.

Which one should be used for Latin-1, UCS2, or UCS4?

We say that Python uses three encodings to represent unicode, with a byte size of 1, 2, and 4 bytes, respectively.

So when Python creates a string, it scans first and tries to use Latin-1 encoding storage with the least number of bytes, but the range is certainly limited. If you find a character that cannot be stored, you can only change the encoding, use UCS2, and continue scanning. But a new character is found, and this character UCS2 cannot be stored either, because two bytes store up to 65535 different characters, so the encoding will be changed again to use UCS4. UCS4 takes up four bytes, so I'm sure it can be saved.

Once the encoding is changed, all characters in the string use the same encoding because they do not have variable length functionality. For example, this string: "hello Gu Ming sense", will definitely use UCS2, there is no saying that hello uses Latin1, Gu Ming sense uses UCS2, because a string can only have one code.

When getting through the index, the index is multiplied by the number of bytes occupied by each character, so it can jump to the exact position, because all the characters in the string occupy the same number of bytes. And then get the specified number of bytes. For example, if you use UCS2 encoding, when you locate a character, you will take two bytes to represent a complete character.

Import sys # is all ascii characters at this time, so Latin-1 encoding can store # so the extra part of the structure instance S1 = "hello" # has 5 characters, one character a byte, so it adds up to 54 bytes of print (sys.getsizeof (S1)) # 5 characters, then Latin-1 must not be able to save it. So use UCS2#, so the extra part of the structure instance at this time accounts for 74 bytes # but don't forget that the English character at this time is also ucs2, so it is also a character of two bytes S2 = "hello Han" # 6 characters, 74 + 6 * 2 = 86print (sys.getsizeof (S2)) # 8 characters this is awesome, and ucs2 can't save it. Only ucs4 can store #, so the extra part of the structure instance takes up 76 bytes S3 = "hello Han" # at this time, all characters occupy 4 bytes, 7 characters # 76 + 7 * 4 = 104print (sys.getsizeof (S3)) # 104

In addition, let's give another example to illustrate this phenomenon more vividly.

Import syss1 = "a" * 1000s2 = "a" * 1000 + "? Print (sys.getsizeof (S1), sys.getsizeof (S2)) # 1049 4080

We know that the only difference between S2 and S1 is that S2 has one more character than S1, but it is such a character that causes S2 to occupy 3031 more bytes than S1. However, these 3031 bytes can not be the size of the extra characters, what characters will account for more than 3, 000 bytes, this is impossible.

In spite of this, it is also the culprit, but the first 1000 characters are also accomplices. We say that Python will choose different encodings according to the string. S1 is all ascii characters, so Latin1 can save it, so one character takes up only one byte. So the size is 49 + 1000 = 1049.

But for s2Jing Python found that the first 1000 characters of Latin1 can be saved, unfortunately the last character can not be saved, so can only use UCS4. While all characters of a string can only have one encoding, in order to ensure that the time complexity of index search is O (1), the characters that can be saved by the previous byte also need to be stored with 4 bytes. This is the design strategy of Python.

When we say that with UCS4, the extra part of the structure occupies 76 bytes, so the size of S2 is: 76 + 1001 * 4 = 4080.

Print (sys.getsizeof ("my youth is back") # 88print (sys.getsizeof ("youth is back") # 104

The number of characters is the same but the amount of memory occupied is different. I'm sure you can analyze the reason.

So if all characters in the string are ASCII characters, encode them with a 1-byte Latin1. Basically, Latin1 can represent the first 256 Unicode characters and supports a variety of Latin languages, such as English, Swedish, Italian, and Norwegian. But they cannot store non-Latin languages, such as Chinese, Japanese, Hebrew and Cyrillic. This is because their code points (numeric indexes) are defined outside the range of 1 byte (0-255).

Most popular natural languages can use 2-byte (UCS2) encoding, but 4-byte (UCS4) encoding is used when strings contain special symbols, emoji, or rare languages. The Unicode standard has nearly 300 blocks (ranges), and you can find 4-byte blocks after 0XFFFF blocks.

Suppose we have a 10G ASCII text and we want to load it into memory, but if we insert an emoji into the text, the size of the string will increase fourfold. This is a huge difference that you may encounter in practice, such as dealing with NLP problems.

Print (ord ("a")) # 97print (ord (Han)) # 25000print (ord ("Han")) # 128187

So the most famous and popular Unicode code is utf-8, but Python does not use it internally, but uses Latin1, UCS2, and UCS4. As for the reason we have explained very clearly above, the main reason is that the index of Python is based on characters, not bytes.

When a string is stored using utf-8 encoding, each character chooses an appropriate size according to itself. This is a highly storage efficient encoding, but it has an obvious drawback. Because the byte length of each character may be different, it is impossible to locate a single character instantly according to the index, and even if it can, it cannot be located accurately. If you want to be accurate, you can only scan all the characters one by one.

Suppose you want to perform a simple operation on a string encoded with utf-8, such as s [5], which means that Python needs to scan every character until it finds the desired character, which is inefficient.

But if it is a fixed-length encoding, there is no such problem, so when the hello stored in Latin 1 is combined with the ancient ground sense stored in UCS2, each character as a whole expands to a large direction and becomes 2 bytes.

In this way, when locating characters, you only need to index * 2 to calculate the number of bytes offset, and then jump to the number of bytes. But if the original hello is still one byte and the Chinese character is 2 bytes, it is impossible to locate the exact character only through the index, because different types of characters have different sizes, and the entire string must be scanned. But scanning strings is inefficient, so this method is used internally in Python instead of utf-8.

So for Go, if you want to be like Python, you need to do this:

Package mainimport ("fmt") func main () {s: = "hello ancient sense" / / We see that the length is 17, because it uses utf-8 encoding fmt.Println (s, len (s)) / / hello ancient sense 17 / / if you want to be like Python / / then Go provides a rune, which is equivalent to int32 / / each character uses 4 bytes at this time So the length becomes 9 r: = [] rune (s) fmt.Println (string (r), len (r)) / / hello 9 / / although the printed content is the same, each character is stored in 4 bytes / / at this time the jump is offset 5 * 4 bytes / / like Python, and then the acquisition also gets 4 bytes. Because one character occupies 4 bytes of fmt.Println (string (r [5])) / / Ancient}

So the characters in the unicode string encoded by utf-8 may occupy different bytes, and obviously there is no way to achieve the index lookup effect of the current Python string, so Python does not use utf-8 encoding.

The practice of Python is to make all characters of the string occupy the same bytes, first use Latin1, which takes up the least memory, and then use UCS2 and UCS4 if not, in short, it will ensure that each character occupies the same bytes. As for the reason, we have analyzed it very thoroughly, because whether it is indexing or slicing, or calculating the length, it is all based on characters, which is obviously in line with human thinking habits.

That's all for "how to store strings in Python". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.