Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

The cause of garbled Code in MySQL and its solution

2025-04-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly explains the causes and solutions of garbled codes in MySQL. Interested friends may wish to have a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn the causes and solutions of garbled codes in MySQL.

The cause of garbled code in MySQL

To understand why there is garbled code, we first need to understand which links will encode / decode from the request initiated by the client, to the data stored in MySQL, and then to the next time the client is retrieved from the table. In order to better explain the process, the blogger made two flow charts, corresponding to the two stages of deposit and withdrawal.

The coding conversion process experienced by saving into MySQL

There are 3 encoding / decoding processes (red arrows) in the image above. The three red arrows correspond to client-side encoding, MySQL Server decoding, and conversion from Client encoding to table encoding. Where Terminal can be a Bash, a web page, or an APP. In this article, we assume that Bash is our Terminal, the input and presentation interface for the client. The behavior corresponding to each box in the figure is as follows:

Input using input method in terminal

Terminal converts to binary stream based on character encoding

The binary stream is transmitted to MySQL Server through the MySQL client

Server decodes through character-set-client

Determine whether the charset of character-set-client is consistent with that of the target table

If inconsistent, perform a character encoding conversion from client-charset to table-charset

Save the converted character-encoded binary stream to a file

The transcoding process of fetching data from a MySQL table

There are 3 encoding / decoding processes (red arrow) in the image above. In the above figure, the three red arrows correspond to each other: client decoding shows that MySQL Server converts table encoding to character-set-client encoding according to character-set-client encoding.

Read a binary data stream from a file

Decode using table character set coding

Convert data to character-set-client encoding

Encode a binary stream using character-set-client

Server is transmitted over the network to the remote client

Client displays the query results through the character encoding configured by bash

The cause of MySQL garbled code

1. The coding of the corresponding link is inconsistent when it is checked in and out.

It is obvious that this will cause garbled codes. The character set used in the three codecs stored in the stage is numbered as C1Magic C2CoC 3 (figure 1 from left to right), and the three character sets at the time of extraction are numbered as C1Codec C2Codec C3' (from left to right). Then bash C1 uses UTF-8 coding when it is saved, and we use windows terminal (default is GBK code) when C1' is taken out, so the result is almost certainly garbled. Or if set names utf8 (C2) is deposited in MySQL and set names gbk (C2') is used when withdrawing it, the result must be garbled.

two。 The coding of the three steps in a single process is inconsistent

In other words, in the three steps in the same direction in any of the above pictures, as long as two or more coding steps are inconsistent, coding and decoding errors may occur. If there is no lossless conversion between the two different character sets (described in more detail below), then garbled is bound to occur. For example, if our shell is UTF8-coded, MySQL's character-set-client is configured as GBK, and the table structure is charset=utf8, there is no doubt that there will be garbled.

Here, let's briefly demonstrate this situation.

Master [localhost] {msandbox} (test) > create table charset_test_utf8 (id int primary key auto_increment, char_col varchar (50)) charset = utf8; Query OK, 0 rows affected (0.04 sec) master [localhost] {msandbox} (test) > set names gbk; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > insert into charset_test_utf8 (char_col) values ('Chinese') Query OK, 1 row affected, 1 warning (0.01sec) master [localhost] {msandbox} (test) > show warnings +-+ | Level | Code | Message | +-+-+ -+ | Warning | 1366 | Incorrect string value:'\ xAD\ xE6\ x96\ x87' for column 'char_col' at row 1 | +-- -+-+ 1 row in set (0.00 sec) master [localhost] {msandbox} (test) > select id Hex (char_col), char_col from charset_test_utf8 +-+ | id | hex (char_col) | char_col | +-+ | 1 | E6B6933FE69E83 |? | +-+-- -+ 1 row in set (0.01 sec)

On the Encoding / Decoding of MySQL

Since the systems are transferred according to the binary stream, it is fine to store the string of binary streams directly into the table file. Why do you need to encode and decode twice before storing it?

The reason for the encoding and decoding of Client to Server is that MySQL needs to do syntax and lexical parsing of the incoming binary stream. Without coding parsing and checking, we can't even know whether a string of binary streams is insert or update.

File to Engine is encoded and decoded to know the word segmentation in the binary stream. To take a simple example: we want to take the first two characters of a field from the table and execute a statement like select left (col,2) from table. The storage engine reads the value of the column from the file as E4B8ADE69687. At this time, if we divide this value into three words E4B8 ADE6 E4B8ADE69687 9687 according to GBK, then the value returned to the client should be E4B8ADE6; if it is divided into E4B8ADMAG E69687 according to UTF8, then the word ADE6 should be returned. It can be seen that it is impossible to operate at the character level inside the storage engine without encoding and decoding after reading the data from the data file.

About the wrong entry and exit

The most common cause of garbled problems in MySQL is to put errors in and out of myths. The so-called error in and out is that the character encoding of the client (web or shell) is different from that of the final table, but the output without garbled code can still be obtained as long as the character set codes of saving and fetching twice are the same. However, wrong in and out is not valid for any combination of two character set encodings. Let's assume that the client's encoding is C. the character set encoding of the MySQL table is S. Then in order to be able to enter and exit by mistake, the following two conditions need to be met.

When MySQL receives a request, the binary stream encoded from C can be lossless when decoded by S

The data returned by MySQL is that the binary stream encoded from S can be lossless when decoded by C.

Lossless coding conversion

So what is lossy conversion and what is lossless conversion? Suppose we want to convert the character X represented by code An into the representation of code B, and the character X does not exist in the glyph set of code B. at this point, we call the transformation lossy. So why is there a difference in the set of characters that the two codes can represent? If you have read the previous ten-minute post by bloggers to figure out the character set and character coding, or have a basic understanding of character coding, you should know that the number of characters supported by each character set is limited. and there are differences between the text covered by each character set. The number of characters that UTF8 and GBK can represent are as follows

The value range of GBK after single character encoding is 8140-FEFE, which does not include * * 7e, and the total number of characters is about 27000.

After encoding a single character in UTF8, the range of values according to the number of bytes is as follows:

Because the number of characters represented by UTF-8 coding far exceeds that of GBK. Then we can easily find a lossy transcoding from UTF8 to GBK. We use the character mapper (see figure below) to find a character that is obviously not in the GBK coding table and try to store it in the GBK coding table. And take out the behavior of looking at the lossy conversion again.

The character information is: "GURMUKHI LETTER A Unicode: U+0A05, UTF-8: E0A8 85"

The details stored in MySQL are as follows:

Master [localhost] {msandbox} (test) > create table charset_test_gbk (id int primary key auto_increment, char_col varchar (50)) charset = gbk; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > set names utf8; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > insert into charset_test_gbk (char_col) values ('e.g.') Query OK, 1 row affected, 1 warning (0.01sec) master [localhost] {msandbox} (test) > show warnings +-- + | Level | Code | Message | + -+ | Warning | 1366 | Incorrect string value:'\ xE0\ xA8\ x85' for column 'char_col' at row 1 | +- -+ 1 row in set (0.00 sec) master [localhost] {msandbox} (test) > select id Hex (char_col), char_col,char_length (char_col) from charset_test_gbk +-+ | id | hex (char_col) | char_col | char_length (char_col) | +-+- -+ | 1 | 3F |? | 1 | +-+ 1 row in set (0.00 sec)

The error occurs in step 3 of the codec. See the figure below for details.

It can be seen that if the MySQL cannot find a GBK character corresponding to a UTF8 character, it will be converted to an error mark (in this case, a question mark). Each character set specifies the behavior and conversion rules when this happens internally when the program is implemented. For example, if the corresponding character cannot be found in UTF8, the character will be replaced with (U+FFFD) if there is no error.

So is the conversion between any two character set codes lossy? This is not the case, and whether the conversion is detrimental depends on the following points:

Whether the converted characters are in two character sets at the same time

Whether the target character set can retain its original expression for unsupported characters

The point of * * has been explained by experiments just now. Here's an explanation for the second factor that causes the lossy transition. We can see from the example just now that the behavior of GBK when dealing with characters that it cannot represent is to replace it with an error ID, that is, 0x3F. On the other hand, some character sets, such as latin1, retain the encoded data of the original character set when they encounter characters that they cannot represent, and skip ignoring the character to process the later data. If the target character set has such a feature, then you can achieve the wrong in and out effect mentioned at the beginning of this section.

Let's take a look at the following example

Master [localhost] {msandbox} (test) > create table charset_test (id int primary key auto_increment, char_col varchar (50)) charset = latin1; Query OK, 0 rows affected (0.03 sec) master [localhost] {msandbox} (test) > set names latin1; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > insert into charset_test (char_col) values ('Chinese') Query OK, 1 row affected (0.01sec) master [localhost] {msandbox} (test) > select id,hex (char_col), char_col from charset_test +-+ | id | hex (char_col) | char_col | +-+ | 2 | E4B8ADE69687 | Chinese | +-+ 2 rows in set (0.00 sec)

The specific flow chart is as follows. It can be seen that the coding inconsistency has actually occurred after it was received by MySQL Server. However, because the Latin1 character set does not do any processing for characters outside the scope of its own expression, it retains the original value. This kind of behavior also makes it possible to get in and out the wrong way.

How to avoid garbled code

After understanding the above, it is easy to avoid garbled code. As long as we achieve the "trinity", that is, the client side, the three character sets of MySQL character-set-client,table charset are exactly the same, we can guarantee that there will be no garbled code. For the data that has been garbled or has suffered lossy transcoding, how to repair it will be relatively difficult. In the next section, we will describe the specific methods in detail.

How to repair data that has been corrupted by coding

Before introducing the right method, let's take a look at the serious consequences of the so-called "right method" circulated on the Internet.

Wrong method one

Both grammatical and literal: ALTER TABLE. There is no doubt that CHARSET=xxx is the best medicine for garbled code. In fact, it doesn't help your corrupted data at all, and you can't even change the default character set of columns that have been created in the table. Let's look at the following example

Master [localhost] {msandbox} (test) > show create table charset_test +-+-+ | Table | Create Table | +-+-+ | charset_test | CREATE TABLE `charset_ test` (`id` int (11) NOT NULL AUTO_INCREMENT `char_ col` varchar (50) DEFAULT NULL, PRIMARY KEY (`id`) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=latin1 | +-+-- + 1 row in set (0.00 sec) master [localhost] {msandbox} (test) > alter table charset_test charset=gbk Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (test) > show create table charset_test +-+-+ | Table | Create Table | +-+-+ | charset_test | CREATE TABLE `charset_ test` (`id` int (11) NOT NULL AUTO_INCREMENT `char_ col` varchar (50) CHARACTER SET latin1 DEFAULT NULL, PRIMARY KEY (`id`) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=gbk | +-+-- + 1 row in set (0.00 sec)

It can be seen that this syntax tightly modifies the default character set of the table, that is, it only affects the default character set of later created columns, but does not change the existing columns and data.

Wrong method two

ALTER TABLE... CONVERT TO CHARACTER SET... Is more lethal than method one, because from the interpretation of the official document, its function is to transcode the data of a table. Here is a short excerpt from the document:

To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:

ALTER TABLE tbl_name

CONVERT TO CHARACTER SET charset_name [COLLATE collation_name]

In fact, this syntax only applies to tables that are not currently garbled and are not saved by the wrong in-and-out method. For tables that already have coding errors due to errors in and out, it will lead to worse results. Let's use a practical example to explain what this sentence SQL actually does and the results it will cause. Suppose we have a table with the code latin1 and the data stored in UTF-8 through the wrong input and output, but because it can still be displayed properly through terminal. That is, the example given in the wrong entry and exit chapter above. After using it for a period of time, we found this error and intend to change the character set encoding of the table to UTF-8 without affecting the normal display of the original data. The consequences of using alter table convert to character set in this case are as follows:

Master [localhost] {msandbox} (test) > create table charset_test_latin1 (id int primary key auto_increment, char_col varchar (50)) charset = latin1; Query OK, 0 rows affected (0.01 sec) master [localhost] {msandbox} (test) > set names latin1; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > insert into charset_test_latin1 (char_col) values ('this is Chinese') Query OK, 1 row affected (0.01sec) master [localhost] {msandbox} (test) > select id,hex (char_col), char_col,char_length (char_col) from charset_test_latin1 +-+-- +-- + | id | hex (char_col) | char_col | char_length (char_col) | +-+- -+-+-- + | 1 | E8BF99E698AFE4B8ADE69687 | this is in Chinese | 12 | +-+-- -+ 1 row in set (0.01 sec) master [localhost] {msandbox} (test) > alter table charset_test_latin1 convert to character set utf8 Query OK, 1 row affected (0.04 sec) Records: 1 Duplicates: 0 Warnings: 0 master [localhost] {msandbox} (test) > set names utf8; Query OK, 0 rows affected (0.00 sec) master [localhost] {msandbox} (test) > select id,hex (char_col), char_col,char_length (char_col) from charset_test_latin1 +-+ | id | | hex (char_col) | char_col | char_length (char_col) | + -+ | 1 | C3A8C2BFE284A2C3A6CB9CC2AFC3A4C2B8C2ADC3A6E28093E280A1 | è re? ™æ Thailand, Spain-Spain | 12 | + -+-+ 1 row in set (0.00 sec)

From this example, we can see that for the wrong data table, this command not only does not have the effect of "rectifying the chaos", but also completely spoils the data, even the binary coding of the data has been changed.

The correct method-Dump & Reload

This method is relatively stupid, but it is also easier to operate and understand. To put it simply, it is divided into the following three steps:

Export to a file through the wrong in and out method

Modify the new table with the correct character set

Import the previously exported file back to the new table

Again, using the example above, we use UTF-8 to "error" the data into the latin1-encoded table. Now you need to change the table encoding to UTF-8 using the following command

Shell > mysqldump-uroot-p-d-- skip-set-charset-- default-character-set=utf8 test charset_test_latin1 > data.sql # ensure that the exported file is checked without garbled shell > mysql-uroot-p-e 'create table charset_test_latin1 (id int primary key auto_increment, char_col varchar (50)) charset = utf8' test shell > mysql-uroot-p-- default-character-set=utf8 test in a text editor with UTF-8 encoding.

< data.sql 正确的方法二 Convert to Binary & Convert Back 这种方法比较取巧,用的是将二进制数据作为中间数据的做法来实现的。由于,MySQL再将有编码意义的数据流,转换为无编码意义的二进制数据的时候并不做实际的数据转换。而从二进制数据准换为带编码的数据时,又会用目标编码做一次编码转换校验。通过这两个特性就相当于在MySQL内部模拟了一次"错出",将乱码"拨乱反正"了。 还是用上面那个例子举例,我们用UTF-8将数据"错进"到latin1编码的表中。现在需要将表编码修改为UTF-8可以使用以下命令 mysql>

ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col VARBINARY (50); mysql > ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col varchar (50) character set utf8; here, I believe you have a deeper understanding of "MySQL garbled reasons and solutions", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report