The Foundation of MySQL Database (2)-- MySQL character set and garbled parsing 07/06 Update SLTechnology News&Howtos

The Foundation of MySQL Database (2)-- MySQL character set and garbled parsing

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

The Foundation of MySQL Database (2)-- MySQL character set and garbled parsing 1, character set and coding 1, character set introduction

Character is the general name of all kinds of characters and symbols, including national characters, punctuation marks, graphic symbols, numbers and so on.

Character set (Character set) is a collection of multiple characters. There are many kinds of character sets, and each character set contains different numbers of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set and so on. If the computer wants to process all kinds of character set text accurately, it needs to carry on the character coding, so that the computer can recognize and store all kinds of characters.

Character coding (Character encoding) encodes a character in a character set into a character in a specified character set so that text can be stored in a computer and transmitted through a communication network. Common examples include coding the Latin alphabet into ASCII,ASCII, numbering letters, numbers, and other symbols, and representing them in a 7-bit binary.

Character order (collation) refers to the comparison rules between characters in the same character set. Only by determining the character order can you define what the equivalent characters are and the size relationship between the characters on a character set. A character can contain multiple character orders. The MySQL character order naming rules are: start with the character set name corresponding to the character order, center with the country name (or centered with general), and end with ci, cs, or bin. Character order ending with ci indicates case insensitivity, character order ending with cs indicates case sensitivity, and character order ending with bin indicates comparison by binary coded value.

2. ASCII coding

ASCII is not only a coded character set, but also a character encoding. ASCII directly stores the sequence number of a character in the encoded character set as a character in the computer.

For example, in ASCII, the A character ranks 65th in the table, the sequence number is 65, and the value of encoded An is 0100 0001, that is, the binary conversion result of decimal 65.

3. Latin1 character set

The Latin1 character set extends the ASCII character set, still using one byte to represent characters, but with high bits enabled, expands the representation range of the character set.

4. UTF-8 coding

UTF-8 (8-bit Unicode Transformation Format) is a kind of variable length character coding for Unicode, also known as universal code. Founded by Ken Thompson in 1992. It has now been standardized to RFC 3629. UTF-8 encodes Unicode characters with 1 to 6 bytes.

UTF-8 is a variable length byte encoding. For the UTF-8 encoding of a character, if there is only one byte, the highest binary bit is 0; if it is multibyte, the first byte starts from the highest bit, the number of consecutive binary bits 1 determines the number of bits encoded, and the remaining bytes start with 10. UTF-8 can use up to 6 bytes. As shown in the table:

1 byte 0xxxxxxx

2-byte 110xxxxx 10xxxxxx

3-byte 1110xxxx 10xxxxxx 10xxxxxx

4-byte 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5-byte 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

6 byte 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Therefore, the actual number of bits that can be used to represent character coding in UTF-8 is up to 31 bits, that is, the bits represented by x in the above table. Except for the control bits (10 at the beginning of each byte, etc.), the bits represented by x correspond to the UNICODE code one-to-one, and the order of the bits is the same.

When actually converting UNICODE to UTF-8 coding, the high bit 0 should be removed first, and then the minimum number of UTF-8 bits required should be determined according to the number of bits left. Therefore, characters in the basic ASCII character set (UNICODE compatible ASCII) can be represented with only one byte of UTF-8 encoding (seven binary bits).

5. Character set compatibility

View the encoding of the specified encoding of any character:

Select hex (convert ('string' using code))

View the characters whose coding values are in the character set:

Select convert (0xABCDXXX using charsetname)

ASCII coding example:

Select hex (convert ('hello' using ascii))

ASCII encoding of the string "hello": 0x68656C6C6F

Latin1 coding example:

Select hex (convert ('hello' using latin1))

Latin1 encoding of the string "hello": 0x68656C6C6F

UTF-8 coding example:

Select hex (convert ('hello' using utf8))

UTF-8 encoding of the string "hello": 0x68656C6C6F

GBK coding example:

Select hex (convert ('hello' using gbk))

GBK encoding of the string "hello": 0x68656C6C6F

GB2312 coding example:

Select hex (convert ('hello' using gb2312))

GB2312 encoding of the string "hello": 0x68656C6C6F

BIG5 coding example:

Select hex (convert ('hello' using big5))

BIG5 encoding of the string "hello": 0x68656C6C6F

As can be seen from the above examples, the Latin1 character set is compatible with the ASCII character set; the UTF-8, GBK, GB2312 and BIG5 character sets are all compatible with the Latin1 character set.

Chinese "very cool" UTF-8 coding example:

Select hex (convert ('awesome' using utf8))

"awesome" UTF-8 code: 0xE5BE88E5B18C

Select CONVERT (0xE5BE88E5B18C USING utf8)

Convert the UTF-8 code value 0xE5BE88E5B18C of the "awesome" character to a character in UTF-8

Chinese "very cool" GBK coding example:

Select hex (convert ('awesome' using gbk))

"awesome" GBK code: 0xBADC8CC5

Select CONVERT (0xBADC8CC5 USING gbk)

Convert the GBK code value 0xBADC8CC5 of the "awesome" character to a character in GBK

Chinese "very cool" GB2312 coding example:

Select hex (convert ('awesome' using gb2312))

"awesome" GB2312 code: 0xBADC3F

Select CONVERT (0xBADC3F USING gbk)

Convert the GB2312 code value 0xBADC3F of the "awesome" character to the character in GBK, and the result is "very?" The character "cool" does not exist in the GB2312 character set.

Chinese "very cool" BIG5 coding example:

Select hex (convert ('awesome' using big5))

"awesome" BIG5 code: 0xABDCCE78

Chinese "very cool" Latin1 coding example:

Select hex (convert ('awesome' using latin1))

"awesome" Latin1 code: 0x3F3F

Chinese "very cool" ASCII coding example:

Select hex (convert ('awesome' using ascii))

"awesome" ASCII code: 0x3F3F

As can be seen from the above examples, for Chinese characters, UTF-8, GBK, GB2312 and BIG5 are incompatible with each other, and direct conversion will lead to garbled codes; when UTF-8, GBK, GB2312 and BIG5 are converted into ASCII and Latin1 formats, each Chinese character will be converted to 0x3F, that is, Chinese character'?' .

GB2312 supports simplified Chinese, BIG5 supports traditional Chinese, GBK supports simplified Chinese and traditional Chinese, and UTF-8 supports almost all characters.

GBK is a standard compatible with GB2312 after expansion based on the national standard GB2312. GB2312 is a subset of GBK and GBK is a subset of GB18030.

2. MySQL character set 1, MySQL environment variables

Session session variable:

Use show variables like'% char%'; to view the character set variables for a Session session:

Set character_set_server=utf8

Set character_set_database=utf8

Using SET, you can set different character sets. However, the character set set using SET is at the Session session level, and if you open a new session, the new session uses the default character set.

Global global variable:

Use show global variables like'% char%'; to view the character set variables of Global:

Set global character_set_database=utf8

Set global character_set_server=utf8

Using SET GLOBAL, you can set the character set for multiple sessions.

Use show charset; to view the character sets supported by MySQL and the character order of the corresponding character sets.

When the MySQL service is restarted, the value of Global is reset to the default value. The method to permanently modify the value of Global is as follows:

Modify the mysql configuration file / etc/my.cnf.

[mysqld]

Character-set-server=utf8

[client]

Default-character-set=utf8

[mysql]

Default-character-set=utf8

2. MySQL character set

MySQL servers can support multiple character sets and provide different levels of settings, including server level, database level, table level, and column level.

MySQL database environment variable view using SQL statement show variables like'% char%'

Character_set_client: the character set used by the client. When the client sends a request to the server, the request is encoded in the client character set.

Character_set_connection: the character set used by the client / database communication connection. After receiving the query request from the client, the MySQL server converts it to the character set specified by the character_set_connection variable.

Character_set_database: the character set of a database in the database server. If there is no default database character set, the character set specified by character_set_server is used.

Character_set_results: the character set returned by the database to the client. The MySQL database converts the result set and error messages to the character set specified by character_set_results and sends it to the client.

Character_set_server: the character set of the database server, the internal operation character set.

Character_set_system: the character set used by system metadata (field names, etc.)

When the client connects to the server, the client will send the name of the character set it wants to the MySQL server, and then the server will use the character set to set character_set_connection, character_set_client, and character_set_results.

If you do not specify the character set of the database when you create the database, the character_set_server character set is used by default.

If you do not specify the character set of the table when you create the table, the current database character set is used by default.

If you do not specify a character set when creating a column, the character set of the current table is used by default.

3. The setting of MySQL character set

A, MySQL server-level character set

Modify the MySQL server configuration file / etc/my.cnf file.

[mysqld]

Character_set_server=utf8

Restarting the MySQL database service takes effect.

B, MySQL database-level character set:

Specify when you create the database:

CREATE DATABASE db_name

[[DEFAULT] CHARACTER SET charset_name]

[[DEFAULT] COLLATE collation_name]

Modify the character set of an existing database:

ALTER DATABASE db_name

[[DEFAULT] CHARACTER SET charset_name]

[[DEFAULT] COLLATE collation_name]

ALTER modifications are only valid for modified operations on the database.

C, MySQL table-level character set:

Specify when you create a table:

CREATE TABLE tbl_name (column_list)

[[DEFAULT] CHARACTER SET charset_name]

[COLLATE collation_name]]

Modify the character set of the table:

ALTER TABLE tbl_name

[[DEFAULT] CHARACTER SET charset_name]

[COLLATE collation_name]

D, MySQL field-level character set:

Modify the character set of an existing field:

ALTER TABLE table_name MODIFY

Column_name {CHAR | VARCHAR | TEXT} (column_length)

[CHARACTER SET charset_name]

[COLLATE collation_name]

MySQL client settings: set names utf8; is equivalent to:

Set character_set_client=utf8

Set character_set_connection=utf8

Set character_set_results=utf8

E, client character set

Modify the MySQL server configuration file / etc/my.cnf file.

[client]

Default-character-set=utf8

Equivalent to set names utf8

Affects the values of the variables character_set_client,character_set_connection and character_set_results in the session.

After modification, there is no need to restart MySQL database service to take effect.

4. The conversion process of MySQL character set

A, when the MySQL server receives the request, it converts the request data from character_set_client character set to character_set_connection character set.

B. convert the request data from the character_set_connection character set to the internal operation character set before the internal operation. Determine the steps:

-- set the value using the CHARACTER SET of each data field

-- if the above value does not exist, use the DEFAULT CHARACTER SET of the corresponding data table to set the value.

-- if the above value does not exist, use the DEFAULT CHARACTER SET of the corresponding database to set the value.

-- if the above value does not exist, use the character_set_server character set to set the value

C, convert the operation result from the internal operation character set to the character_set_results character set.

D, convert the execution result of the character_set_results character set to the character_set_client character set, send it to the client, and the client uses the set character set to display the result.

Third, the generation of garbled codes in MySQL 1. The causes of garbled codes in MySQL

The reasons for garbled codes are as follows:

A. the coding of the corresponding link is inconsistent when it is checked in and out.

B. if there is no lossless conversion between the two character sets, garbled is bound to occur.

2. Lossless coding conversion

If a character X represented by code An is converted into the representation of code B, and there are no X characters in the character set of code B, the transcoding is lossy, otherwise the transcoding is lossless.

Because the number of characters supported by each character set is limited, and there are differences between the characters covered by each character set. When converting UTF-8 characters to GBK characters, if the MySQL cannot find a character in the UTF8 character set in the GBK character set, it will be converted to an error mark (0x3F, question mark).

Conditions for lossless conversion of coding:

A. whether the converted characters are in two character sets at the same time.

B, whether the target character set can retain its original expression for unsupported characters.

Refer to blog posts:

Http://cenalulu.github.io/

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.