In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to solve the problem of Chinese garbled code caused by JSON type UTF-8 coding by MySQL. In view of this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.
Let's learn about character coding. In the process of learning, I found that there is still room for further digging for the JSON type encoding in MySQL that leads to garbled code in Chinese data. Next, let's analyze it. If there are any errors, please criticize and point them out.
It is pointed out in the comments that any Unicode character that is not in the basic multi-text plane cannot be stored in MySQL's utf8 character set, including Emoji emoticons (Emoji is a special Unicode encoding commonly found on IOS and Android mobile phones) and many less commonly used Chinese characters, as well as any new Unicode characters, etc. (the shortcomings of utf8).
First, let's take a look at what Unicode,Unicode is a universal character set, a standard that defines, in one place, all the characters needed to write most active languages used on computers, and its goal is to be, and to a large extent, a superset of all other character sets that have been encoded. Text in a computer or network is made up of characters that represent letters, punctuation, or other symbols. Different organizations have collected different character sets and created codes for them-one character set may cover only Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover specific far Eastern languages (such as Japanese), and other languages may be specially designed to represent one of the many languages somewhere else in the world. However, we cannot guarantee that the application will support all encodings, nor can we guarantee that a given encoding will meet all the requirements we need to represent a given language. In addition, it is usually impossible to combine different encodings in the same web page or database, so it is often very difficult to use "traditional" encoding methods to support multilingual pages.
The Unicode Association provides a large, single-byte character set designed to include characters from any writing system in the world, including ancient scripts (such as cuneiform, Gothic and Egyptian hieroglyphs), so unify character coding as the basis for Web and operating system architectures, and are supported by all major Web browsers and applications. The current Unicode characters are divided into 17 groups, each of which is called a Plane, so the characters are divided into 0-16 planes, and each plane has 65536 code points, that is, between 0x000 and 0xFFFF, and the 0 plane is the basic multilingual plane (BMP:Basic Mutiingual Plane). A coding range is given for each text or its supplement or extension on the basic multi-text plane, such as Latin [0000-007F], Latin-supplement [0080-00FF], and so on. Having said that, we only need to remember one thing: the first 65536 code points in the Unicode character set constitute the basic multilingual plane abbreviation BMP,BMP contains most commonly used characters, and the Unicode character set also contains the location space of a million other code points, which we call supplementary characters.
We need to distinguish between the concept of a character set, a coded character set and a coding. A character set or string contains a character set that may be used for a specific purpose. It is a character set required to support Western European languages on a computer and has nothing to do with the computer. A coded character set is a set of characters used for that unique number to be assigned to each character. Sometimes we call a coded character set as a code page. The unit of the coded character set is called the code point, and the code point value represents the position of the character in the encoded character set. For example, the code point of the letter á in the Unicode coded character set is decimal 225 and the hexadecimal representation is 0xE1. The character coding reflects that the coded character set is mapped to the way in which bytes are manipulated on the computer. A character set can have multiple encodings, and many character encoding standards, such as those defined in the ISO 8859 series, use a single byte for a given character, and the encoding is a direct mapping of the scalar position of the characters in the encoded character set. For example, the letter An in the ISO 8859-1 encoded character set is in the 65th character position (zero-based) and is encoded using a byte with a value of 65 and represented on the computer, which will never change for ISO 8859-1.
However, for Unicode, things are not that simple. Although the code point of the letter á in the Unicode encoded character set is always 225 (decimal), in UTF-8, it is represented by two bytes in the computer, in other words, there is not a simple one-to-one mapping between the encoded character set value and the coded value of this character, and there are multiple ways to encode for the same character in Unicode. For example, the letter á can be represented by two bytes in one encoding and four bytes in another. The encoding forms that can be used with Unicode are called UTF-8,UTF-16 and UTF-32.
UTF-8 uses 1 byte for characters in the ASCII set, 2 bytes for characters in other letter blocks, 3 bytes for the rest of the BMP, and 4 bytes for supplementary characters. UTF-16 uses 2 bytes for any character in BMP and 4 bytes for supplementary characters. UTF-32 uses 4 bytes for all characters. The common characters are stored in the code points corresponding to the basic multilingual plane. Above, different characters are stored at their code points, and then the hexadecimal string of the character is calculated. Take a chestnut, UTF-8 encodes the [good] word to see the byte value and number of bytes of the character, as follows:
Now that we have probably finished understanding the character encoding, let's go back to the question in the previous section, which stores my name as JSON in the database, but when we finally get the data, there will be garbled code because the table is encoded as utf8, and finally the table encoding is changed to utf8mb4, so why not utf8? Through the above definition of utf8, there can be up to 4 bytes and supplementary characters are supported, so MySQL does not implement the standard utf8 encoding at all, in other words, it only partially implements the utf8 encoding. The utf8 in MySQL is also known as utf8mb3, that is, a character can be represented by up to 3 bytes and contains BMP characters, but not supplementary characters. So either my last name or my first name is 3 bytes, but it's not caused by common BMP characters. But for the fact that the column type is JSON, is it really garbled for getting Chinese? The version of MySQL I used above is 5.7 cycles, as follows:
Next, we use MySQL 8.0 to test and find that there will be no garbled code. The class creation and table configuration codes are as follows:
With the rise of mobile, emoticons appear, so starting from MySQL 5.5.3, the introduction of utf8mb4 character set allows a maximum of 4 bytes per character and supports supplementary characters. For BMP characters, utf8 [utf8mb3] and utf8mb4 have the same storage characteristics: the same code value, the same encoding, the same length, for supplementary characters, utf8 [utf8mb3] cannot store the character at all, while utf8mb4 needs 4 bytes to store it. Because utf8 [utf8mb3] cannot store characters at all, there are no supplementary characters in the utf8 [utf8mb3] column. Next, when we configure utf8 encoding for the JSON type, we insert the expression, and we will find that it is also possible. We can get the number of bytes of the corresponding characters, such as 4 bytes for the funny and funny emoticons:
In fact, in view of the garbled data obtained by the JSON type, someone has long proposed the relevant bug. For details, please see the address "https://bugs.mysql.com/bug.php?id=81677". However, there has been no official reply. At least the results from the above tests are OK for utf8 to store facial expressions. Let's take a look at version 8.0 to describe the utf8 code. For more information, please see "https://dev.mysql.com/doc/refman/8.0/en/charset-unicode.html". The description of utf8 encoding can still be stored up to 3 bytes, as follows:
Don't forget, and note: the utf8 [utf8mb3] character set has been deprecated and will be removed in future versions of MySQL. Please use utf8mb4 instead. Although utf8 is currently an alias for utf8mb3, at some point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider explicitly specifying utf8mb4 instead of utf8 for character set references.
So we have understood that although the utf8 encoding in version 8.0 can support up to 3 bytes, it will make utf8 a reference to utf8mb4, so it is not difficult to understand why when the table is configured as utf8 encoding, it is no problem to store data and express BMP characters that are no longer commonly used in JSON types.
This is the answer to the question about how to solve the problem of Chinese garbled code caused by MySQL's JSON type UTF-8 coding. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.