Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the MySQL character set and comparison rules

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what is the MySQL character set and comparison rules". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "what is the MySQL character set and comparison rules" can help you solve the problem.

Introduction to character set and comparison rules introduction to character set

We know that only binary data can be stored in a computer, so how do we store strings? Of course, it is to establish a mapping relationship between characters and binary data, and to establish this relationship, you need to understand at least two things:

Which characters do you want to map to binary data?

That is, clearly define the range of characters.

How to map?

The process of mapping a character to a binary data is also called encoding, and the process of mapping a binary data to a character is called decoding.

People abstract the concept of a character set to describe the coding rules of a range of characters. For example, we customize a character set named xiaohaizi, which contains the following character range and encoding rules:

Contains the characters'a','b','A','B'.

The coding rules are as follows:

Use 1 byte to encode one character. The mapping relationship between character and byte is as follows:

'a'-> 00000001 (hexadecimal: 0x01)'b'-> 00000010 (hexadecimal: 0x02)'A'-> 00000011 (hex: 0x03)'B'-> 00000100 (hexadecimal: 0x04)

With the xiaohaizi character set, we can represent some strings in binary form. Here are some strings encoded in the xiaohaizi character set:

'bA'-> 0000001000000011 (hexadecimal: 0x0203)' baB'-> 000000100000000100000100 (hexadecimal: 0x020104) 'cd'-> cannot be expressed. Character set xiaohaizi does not contain characters' c' and'd'. Introduction to comparison rules.

After we have determined the range of characters represented by the xiaohaizi character set and the encoding rules, how do we compare the size of the two characters? The easiest thing to think of is to directly compare the size of the binary code corresponding to these two characters. For example, the encoding of the character'a' is 0x01, and the coding of the character'b' is 0x02, so'a' is less than 'bread. this simple comparison rule can also be called the binary comparison rule, which is called binary collation in English.

The binary comparison rule is simple, but sometimes it does not meet the practical requirements. For example, in many cases, we are not case-sensitive for English characters, that is,'a 'and' A 'are equal. In this case, we cannot simply and rudely use the binary comparison rule. At this time, we can specify the comparison rule as follows:

Convert two characters with different case to uppercase or lowercase.

Then compare the binary data corresponding to these two characters.

This is a slightly more complicated comparison rule, but there are more than one kind of English characters in real life, for example, we have tens of thousands of Chinese characters. For a certain character set, there are many rules for comparing the size of two characters, that is to say, there can be multiple comparison rules for the same character set. Later, we will introduce various character sets used in real life and some of their comparison rules.

Some important character sets

Unfortunately, the world is so big that different people have developed many character sets, which may represent a different range of characters and use different coding rules. Let's take a look at some common character sets:

ASCII character set

It contains 128 characters, including spaces, punctuation marks, numbers, uppercase and lowercase letters and some invisible characters. Since the total is only 128 characters, you can use 1 byte to encode. Let's look at how some characters are encoded:

'L'-> 01001100 (hexadecimal: 0x4C, decimal: 76)'M'-> 01001101 (hexadecimal: 0x4D, decimal: 77)

ISO 8859-1 character set

A total of 256characters are included, which extends 128characters commonly used in Western Europe (including letters from Germany and France) on the basis of the ASCII character set, and can also be encoded with 1 byte. This character set also has an alias latin1.

GB2312 character set

Contains Chinese characters as well as Latin alphabet, Greek alphabet, Japanese hiragana and katakana letters, Russian Cyrillic alphabet. Among them, there are 6763 Chinese characters and 682 other symbols. At the same time, this character set is compatible with the ASCII character set, so it seems a bit strange in the way it is encoded:

This encoding method, which represents the number of bytes required for a character may be different, is called variable length encoding. For example, the string 'love', in which 'love' needs to be encoded with 2 bytes, the encoded hexadecimal representation as 0xB0AErecoveryu' needs to be encoded with 1 byte, and the encoded hexadecimal representation is 0x75, so it is 0xB0AE75.

Tip: how can we tell whether a byte represents a single character or a part of a character? Don't forget that the `ASCII` character set contains only 128 characters, and you can represent all characters with 0x127, so if a byte is within 0x127, it means that a byte represents a single character, otherwise two bytes represent a single character.

If the character is in the ASCII character set, 1-byte encoding is used.

Otherwise, 2-byte encoding is used.

GBK character set

The GBK character set only extends the GB2312 character set in the range of included characters, and is compatible with GB2312 in the coding mode.

Utf8 character set

It contains all the characters that can be thought of on earth, and it is constantly expanding. This character set is compatible with the ASCII character set and uses variable length encoding, which requires 1 to 4 bytes to encode a character, such as this:

'L'-> 01001100 (hexadecimal: 0x4C)'ah'- > 111001011001010110001010 (hexadecimal: 0xE5958A) Tip: actually, utf8 is just a coding scheme of the Unicode character set, Unicode character set can use utf8, utf16, utf32 these coding schemes, utf8 uses 1 inch 4 bytes to encode a character, utf16 uses 2 or 4 bytes to encode a character, utf32 uses 4 bytes to encode a character. More detailed knowledge of Unicode and its coding scheme is not the focus of this book, we check on the Internet ha ~ MySQL does not distinguish between the concept of character set and coding scheme, so later nagging when utf8, utf16, utf32 are treated as a character set.

For the same character, different character sets may have different encoding methods. For example, for the Chinese character "I", the ASCII character set does not include this character at all. The utf8 and gb2312 character sets encode Chinese characters as follows:

Utf8 Encoding: 111001101000100010010001 (3 bytes, hexadecimal representation: 0xE68891) gb2312 Encoding: 1011000010101110 (2 bytes, hexadecimal representation is: 0xB0AE) character set supported in MySQL and utf8 and utf8mb4 in collation MySQL

We said above that the utf8 character set represents a character using 1x 4 bytes, but some of our commonly used characters can be represented with 1 x 3 bytes. In MySQL, the character set represents the maximum byte length used by a character will affect the storage and performance of the system in some ways, so the uncle who designed MySQL secretly defined two concepts:

Utf8mb3: a castrated utf8 character set that uses only 1 to 3 bytes to represent characters.

Utf8mb4: authentic utf8 character set that uses 1 to 4 bytes to represent characters.

It is important to note that utf8 is an alias for utf8mb3 in MySQL, so mentioning utf8 in MySQL means using 1'3 bytes to represent a character. If you use 4 bytes to encode a character, such as storing some emoji emoticons, please use utf8mb4.

View of character set

MySQL supports many character sets. To view the character sets supported in the current MySQL, you can use the following statement:

SHOW (CHARACTER SET | CHARSET) [pattern matched by LIKE]

"CHARACTER SET" and "CHARSET" are synonyms and can be used either way. Let's check (there are too many character sets supported, we omitted some):

Mysql > SHOW CHARSET +-+ | Charset | Description | Default collation | Maxlen | +-+- -+ | big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |. | latin1 | cp1252 West European | latin1_swedish_ci | 1 | | latin2 | ISO 8859-2 Central European | latin2_general_ci | 1 |... | ascii | US ASCII | ascii_general_ci | 1 |. | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |. | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 | | latin5 | ISO 8859-9 Turkish | latin5_turkish_ci | 1 |. | | utf8 | UTF-8 Unicode | utf8_general_ci | 3 | | ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |. | latin7 | ISO 8859-13 Baltic | latin7_general_ci | 1 | | utf8mb4 | UTF-8 Unicode | utf8mb4_general_ci | 4 | | utf16 | UTF-16 Unicode | utf16_general_ci | 4 | | utf16le | UTF-16LE Unicode | utf16le_general_ci | 4 |. | utf32 | UTF-32 Unicode | utf32_general_ci | 4 | | binary | Binary pseudo charset | binary | 1 |. | gb18030 | China National Standard GB18030 | gb18030_chinese_ci | 4 | +-+-+ 41 rows in set (0.01 sec)

As you can see, the MySQL version I'm using supports a total of 41 character sets, and the Default collation column represents a default comparison rule for this character set. Notice the last column Maxlen in the returned result, which represents the character set that up to a few bytes is needed for a character. In order to impress you more, I have copied down the Maxlen columns of several commonly used character sets, which you must remember:

View of Maxlenascii1latin11gb23122gbk2utf83utf8mb44 comparison rules for character set names

The commands to view the comparison rules supported in MySQL are as follows:

SHOW COLLATION [pattern matched by LIKE]

As we mentioned earlier, a character set may correspond to several comparison rules. There are already many character sets supported by MySQL, so more comparison rules are supported. Let's just take a look at the comparison rules under the utf8 character set:

Mysql > SHOW COLLATION LIKE 'utf8\ _%' +-- +-+ | Collation | Charset | Id | Default | Compiled | Sortlen | +- +-+ | utf8_general_ci | utf8 | 33 | Yes | Yes | 1 | | utf8_bin | utf8 | 83 | | Yes | 1 | utf8_unicode_ci | utf8 | | | Yes | 8 | utf8_icelandic_ci | utf8 | 193 | Yes | 8 | utf8_latvian_ci | utf8 | 194 | Yes | 8 | utf8_romanian_ci | utf8 | 195 | Yes | 8 | utf8_slovenian_ci | utf8 | 196 | | | Yes | 8 | utf8_polish_ci | utf8 | 197 | Yes | 8 | utf8_estonian_ci | utf8 | 198 | Yes | 8 | utf8_spanish_ci | utf8 | 199 | Yes | 8 | utf8_swedish_ci | utf8 | 200 | | Yes | | | 8 | | utf8_turkish_ci | utf8 | 201 | | Yes | 8 | utf8_czech_ci | utf8 | 202 | Yes | 8 | utf8_danish_ci | utf8 | 203 | Yes | 8 | utf8_lithuanian_ci | utf8 | 204 | | Yes | 8 | | utf8_slovak_ci | utf8 | 205| Yes | 8 | utf8_spanish3_ci | utf8 | 206| Yes | 8 | utf8_roman_ci | utf8 | 207| Yes | 8 | utf8_persian_ci | utf8 | 208| Yes | 8 | | utf8_esperanto_ci | utf8 | 209 | | Yes | 8 | utf8_hungarian_ci | utf8 | 210 | Yes | 8 | utf8_sinhala_ci | utf8 | 211 | Yes | 8 | utf8_german2_ci | utf8 | 212 | Yes | 8 | | utf8_croatian_ Ci | utf8 | 213 | | Yes | 8 | utf8_unicode_520_ci | utf8 | 214 | Yes | 8 | utf8_vietnamese_ci | utf8 | 215 | Yes | 8 | utf8_general_mysql500_ci | utf8 | 223 | Yes | 1 | +- -+-+ 27 rows in set (0.00 sec)

The naming of these comparative rules is quite regular, the specific rules are as follows:

The comparison rule name begins with the name of the character set associated with it. As shown in the figure above, the names of the comparison rules for query results all start with utf8.

This is followed by the language in which the comparison rule is mainly used, such as utf8_polish_ci for comparison with Polish rules, utf8_spanish_ci with Spanish rules, and utf8_general_ci is a general comparison rule.

The name suffix means whether the comparison rule distinguishes stress, case and so on in the language. The specific values that can be used are as follows:

Suffix English definition description _ aiaccent insensitive is not stress-sensitive _ asaccent sensitive is stress-sensitive _ cicase insensitive is case-insensitive _ cscase sensitive is case-sensitive _ binbinary is compared in binary

For example, the comparison rule of utf8_general_ci ends with ci, indicating that it is not case-sensitive.

Each character set corresponds to several comparison rules, and each character set has a default comparison rule. The default comparison rule of the character set is that the value of the Default column in the returned result of SHOW COLLATION is YES. For example, the default comparison rule of the utf8 character set is utf8_general_ci.

Application of character sets and comparison rules at all levels

MySQL has four levels of character sets and comparison rules, which are:

Server level

Database level

Table level

Column level

Next, let's take a closer look at how to set and view these levels of character sets and comparison rules.

Server level

MySQL provides two system variables to represent the server-level character set and comparison rules:

System variable describes character_set_server server-level character sets collation_server server-level comparison rules

Let's take a look at the values of these two system variables:

Mysql > SHOW VARIABLES LIKE 'character_set_server' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_server | utf8 | +-- +-+ 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'collation_server' +-+-+ | Variable_name | Value | +-+-+ | collation_server | utf8_general_ci | + -+ 1 row in set (0.00 sec)

You can see that on my computer, the default character set at the server level is utf8, and the default comparison rule is utf8_general_ci.

We can modify the values of these two variables through the startup option when starting the server program or by using the set statement while the server program is running. For example, we can write this in the configuration file:

[server] character_set_server=gbkcollation_server=gbk_chinese_ci

The values of these two system variables are modified after the configuration file is read when the server starts.

Database level

We can specify the character set and comparison rules of the database when creating and modifying the database. The specific syntax is as follows:

CREATE DATABASE database name [[DEFAULT] CHARACTER SET character set name] [[DEFAULT] COLLATE comparison rule name]; ALTER DATABASE database name [[DEFAULT] CHARACTER SET character set name] [[DEFAULT] COLLATE comparison rule name]

The DEFAULT can be omitted and does not affect the semantics of the statement. For example, when we create a new database called charset_demo_db, we specify that the character set it uses is gb2312, and the comparison rule is gb2312_chinese_ci:

Mysql > CREATE DATABASE charset_demo_db-> CHARACTER SET gb2312-> COLLATE gb2312_chinese_ci;Query OK, 1 row affected (0.01sec)

If you want to see the character set and comparison rules used by the current database, you can view the values of the following two system variables (provided you select the current default database using the user statement, if there is no default database, the variable has the same value as the corresponding server-level system variable):

System variable describes the character set of the current character_set_database database collation_database comparison rules of the current database

Let's take a look at the character set and comparison rules of the charset_demo_db database we just created:

Mysql > USE charset_demo_db;Database changedmysql > SHOW VARIABLES LIKE 'character_set_database' +-+-+ | Variable_name | Value | +-+-+ | character_set_database | gb2312 | +-+- + 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'collation_database' +-+-+ | Variable_name | Value | +-- + | collation_database | gb2312_chinese_ci | +- -+-+ 1 row in set (0.00 sec) mysql >

You can see that the character set and comparison rules of the charset_demo_db database are the ones we specified in the creation statement. It is important to note that the two system variables character_set_database and collation_database are read-only, and we cannot change the character set and comparison rules of the current database by changing the values of these two variables.

You can also not specify character sets and comparison rules in the database creation statement, such as this:

CREATE DATABASE database name

In this way, server-level character sets and comparison rules will be used as database character sets and comparison rules.

Table level

We can also specify the character set and comparison rules of the table when creating and modifying the table. The syntax is as follows:

CREATE TABLE table name (column information) [[DEFAULT] CHARACTER SET character set name] [COLLATE comparison rule name]] ALTER TABLE table name [[DEFAULT] CHARACTER SET character set name] [COLLATE comparison rule name]

For example, let's create a table named t in the charset_demo_db database we just created and specify the character set and comparison rules for this table:

Mysql > CREATE TABLE t (- > col VARCHAR (10)->) CHARACTER SET utf8 COLLATE utf8_general_ci;Query OK, 0 rows affected (0.03 sec)

If the character set and comparison rules are not specified in the statement that creates and modifies the table, the character set and comparison rules of the database in which the table resides are used as the character set and comparison rules of the table. Suppose our statement to create table t reads as follows:

CREATE TABLE t (col VARCHAR (10))

Because the character set and comparison rules are not explicitly specified in the table building statement of table t, the character set and comparison rules of table t will inherit the character set and comparison rules of the database charset_demo_db, namely gb2312 and gb2312_chinese_ci.

Column level

It is important to note that for columns that store strings, different columns in the same table can have different character sets and comparison rules. We can specify the character set and comparison rules for the column when we create and modify the column definition. The syntax is as follows:

CREATE TABLE table name (column name string type [CHARACTER SET character set name] [COLLATE comparison rule name], other columns.); ALTER TABLE table name MODIFY column name string type [CHARACTER SET character set name] [COLLATE comparison rule name]

For example, let's modify the character set and comparison rules of column col in table t, which can be written as follows:

Mysql > ALTER TABLE t MODIFY col VARCHAR (10) CHARACTER SET gbk COLLATE gbk_chinese_ci;Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0mysql >

For a column, if the character set and comparison rules are not specified in the statement created and modified, the character set and comparison rules of the table in which the column is located will be used as the character set and comparison rules for the column. For example, the character set of table t is utf8, and the comparison rule is utf8_general_ci. The statement that modifies the column col is written as follows:

ALTER TABLE t MODIFY col VARCHAR (10)

The character set and encoding of that column col will use the character set and comparison rules of table t, namely utf8 and utf8_general_ci.

Tips

When converting the character set of a column, it should be noted that an error will occur if the data stored in the front column of the conversion cannot be represented by the converted character set. For example, the original character set used in the column is utf8, and some Chinese characters are stored in the column. Now converting the character set of the column to ascii will make an error, because the ascii character set does not represent Chinese characters.

Modify only the character set or only the comparison rules

Because the character set and the comparison rules are related to each other, if we only modify the character set, the comparison rules will also change. If only the comparison rules are modified, the character set will also change. The specific rules are as follows:

If only the character set is modified, the comparison rule becomes the default comparison rule for the modified character set.

If only the comparison rule is modified, the character set will become the character set corresponding to the modified comparison rule.

These two rules apply regardless of the level of character set and comparison rules. Let's take a look at the detailed process of server-level character sets and comparison rules:

If only the character set is modified, the comparison rule becomes the default comparison rule for the modified character set.

Mysql > SET character_set_server = gb2312;Query OK, 0 rows affected (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_server' +-- +-+ | Variable_name | Value | +-+-+ | character_set_server | gb2312 | +-- +-+ 1 row in set (0. 00 sec) mysql > SHOW VARIABLES LIKE 'collation_server' +-+-+ | Variable_name | Value | +-- + | collation_server | gb2312_chinese_ci | +- +-+ 1 row in set (0.00 sec)

We only changed the value of character_set_server to gb2312,collation_server and automatically changed it to gb2312_chinese_ci.

If only the comparison rule is modified, the character set will become the character set corresponding to the modified comparison rule.

Mysql > SET collation_server = utf8_general_ci;Query OK, 0 rows affected (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_server' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_server | utf8 | +-- +-+ 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'collation_server' +-+-+ | Variable_name | Value | +-+-+ | collation_server | utf8_general_ci | + -+ 1 row in set (0.00 sec) mysql >

We only changed the value of collation_server to UTF8 _ general_ci,character_set_server and automatically changed the value to utf8.

Summary of character sets and comparison rules at different levels

The links between the four levels of character sets and comparison rules are as follows:

If you create or modify a column without explicitly specifying the character set and comparison rules, the column defaults to the character set and comparison rules of the table

If the table is created without an explicit specified character set and comparison rules, the table defaults to the database character set and comparison rules

If you create a database without explicitly specifying the character set and comparison rules, the database defaults to the server's character set and comparison rules

After knowing these rules, for a given table, we should know what the character set and comparison rules of its columns are, so as to determine the amount of storage space occupied by the actual data of each column when storing data according to the type of this column. For example, let's insert a record into table t:

Mysql > INSERT INTO t (col) VALUES ('I'); Query OK, 1 row affected (0. 00 sec) mysql > SELECT * FROM tbot's | +-+ | me | +-+ 1 row in set (0. 00 sec)

The character set used by the first column col is gbk, and the encoding of a character'I'in gbk is 0xCED2, which takes up two bytes, and the actual data of two characters takes up four bytes. If the character set of the column is changed to utf8, these two characters will actually occupy 6 bytes.

Character set in client-server communication

The consequences of the inconsistency of character sets used in encoding and decoding

After all, the representation of a string on a computer is a byte string, and if you use a different character set to decode the byte string, the result may make you scratch your head.

We know that the length of the byte string of the character'I 'encoded by the utf8 character set is as follows: 0xE68891, if one program sends this byte string to another program, and another program uses a different character set to decode the byte string, suppose the gbk character set is used to interpret the byte string, the decoding process is like this:

First of all, look at the first byte 0xE6, whose value is greater than 0x7F (decimal: 127), indicating that it is a two-byte code, and continue to read a byte followed by 0xE688. Then look up the character corresponding to the byte 0xE688 from the gbk coding table, and find that it is the character 'character'.

Continue to read a byte 0x91, its value is also greater than 0x7F, and then read another byte to find that there is nothing, so this is half a character.

So 0xE68891 is interpreted by the gbk character set as a character of 'characters' and half a character.

Suppose the string of bytes is interpreted with the iso-8859-1, that is, the latin1 character set, the decoding process is as follows:

Read the first byte 0xE6 first, and its corresponding latin1 character is æ.

Read the second byte 0x88, which corresponds to the latin1 character.

Read the third byte 0x91, which corresponds to the latin1 character'.

So the string after the whole string of bytes 0xE68891 is interpreted by the latin1 character set is'æ characters'.

It can be seen that if we use different character sets for the same string encoding and decoding, it will produce unexpected results, and as human beings, we seem to have generated garbled codes.

The concept of character set conversion

If the program that receives the 0xE68891 byte string decodes it according to the utf8 character set, and then encodes it according to the gbk character set, the final encoded byte string is 0xCED2. We call this process the character set conversion, that is, the string'I'is converted from the utf8 character set to the gbk character set.

Conversion of character set in MySQL

We know that the request from the client to the server is essentially a string, and the result returned by the server to the client is essentially a string, and the string is actually binary data encoded using a certain character set. This string does not use a character set encoding method to go black. The process from sending a request to returning a result is accompanied by multiple character set conversions. Three system variables are used in this process. Let's write them out and have a look at them:

The system variable describes the character set used by the character_set_client server to decode the request. When the character_set_connection server processes the request, it changes the request string from character_set_client to the character set used by the character_set_connectioncharacter_set_results server to return data to the client.

The default values of these system variables on my computer are as follows (the default values may vary from operating system to operating system):

Mysql > SHOW VARIABLES LIKE 'character_set_client' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_client | utf8 | +-- +-+ 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_connection' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_connection | utf8 | +-- + -+ 1 row in set (0.01sec) mysql > SHOW VARIABLES LIKE 'character_set_results' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_results | utf8 | +-- + 1 row in set (0.00 sec)

You can see that the values of these system variables are all utf8. In order to reflect the change of the character set during the request processing, we specially modify the value of a system variable:

Mysql > set character_set_connection = gbk;Query OK, 0 rows affected (0.00 sec)

So now the value of the system variables character_set_client and character_set_results is still utf8, while the value of character_set_connection is gbk. Now suppose the request sent by our client is the following string:

SELECT * FROM t WHERE s ='I'

In order to make it easier for you to understand this process, we only analyze the conversion of the character set of the character'I'in this process.

Now take a look at the change in the character set as the request is sent to the result return:

The character set used by the client to send the request

In general, the character set used by the client is the same as the current operating system, and different operating systems may use different character sets, as follows:

For example, when I use the macOS operating system, the client uses the utf8 character set. So the byte form of the character'I'in the request sent to the server is: 0xE68891

Tip: if you are using visualization tools such as navicat, these tools may use custom character sets to encode strings sent to the server instead of the operating system's default character set (so try to use black boxes when learning).

Unix-like systems use utf8

Windows uses gbk

The server receives that the request sent by the client is actually a string of binary bytes, it will think that the character set of the string of bytes is character_set_client, and then convert the string of bytes into characters encoded by the character_set_connection character set.

Because the value of character_set_client on my computer is utf8, the byte string 0xE68891 is first decoded according to the utf8 character set, and the resulting string is'I', then encoded according to the character set represented by character_set_connection, that is, gbk, and the result is the byte string 0xCED2.

Because the column col of table t uses the gbk character set, which is consistent with character_set_connection, go directly to the column to find a record with a byte value of 0xCED2, and finally find a record.

Tip: if the character set used by a column does not match the character set represented by character_set_connection, you also need to perform a character set conversion.

The col column in the record found in the previous step is actually a byte string 0xCED2Gol column is encoded using gbk, so first the byte string will be decoded using gbk to get the string'I', and then the string will be encoded using the character set represented by character_set_results, that is, utf8, to get a new byte string: 0xE68891, and then send it to the client.

Because the character set used by the client is utf8, the 0xE68891 can be interpreted as a character I and displayed on our monitor, so we humans can also read the returned results.

If you are a little dizzy in reading the above text, you can refer to this picture to carefully analyze these steps:

From this analysis, we can draw the following points that need to be noted:

The server believes that the request sent by the client is encoded in character_set_client.

Assuming that your client uses a different character set than character_set_client, this will lead to unexpected situations. For example, my client uses the utf8 character set. If you set the system variable character_set_client to ascii, the server may not be able to understand the request we sent, let alone process it.

The server will send the resulting result set to the client after character_set_results encoding.

Assuming that your client uses a different character set than character_set_results, it is possible that the client cannot decode the result set, resulting in garbled code on your screen. For example, my client uses the utf8 character set. If you set the value of the system variable character_set_results to ascii, it may generate garbled code.

Character_set_connection is only used by the server when converting the requested byte string from character_set_client to character_set_connection, it doesn't really matter what it is, but it must be noted that the character set must cover the characters in the request, otherwise some characters cannot be encoded using the character set represented by character_set_connection. For example, if you set character_set_client to utf8 and character_set_connection to ascii, if you send a Chinese character from the client to the server, the server cannot use the ascii character set to encode the Chinese character, which will issue a warning to the user.

Know about the various character set conversions that occur in MySQL from sending a request to returning a result, but why turn around? Aren't you dizzy?

A: yes, it's very dizzy, so we usually set the three system variables character_set_client, character_set_connection and character_set_results to be consistent with the character set used by the client, which reduces a lot of unnecessary character set conversion. To facilitate our setup, MySQL provides a very simple statement:

SET NAMES character set name

The effect of this statement is the same as that of our execution of these three:

SET character_set_client = character set name; SET character_set_connection = character set name; SET character_set_results = character set name

For example, my client uses the utf8 character set, so I need to set the values of these system variables to utf8:

Mysql > SET NAMES utf8;Query OK, 0 rows affected (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_client' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_client | utf8 | +-- +-+ 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_connection' +-- +-+ | Variable_name | Value | +-- +-+ | character_set_connection | utf8 | +-- + -+ 1 row in set (0.00 sec) mysql > SHOW VARIABLES LIKE 'character_set_results' +-- +-+ | Variable_name | Value | +-+-+ | character_set_results | utf8 | +-- + 1 row in set (0.00 sec) mysql >

In addition, if you want to set the values of the three system variables character_set_client, character_set_connection and character_set_results to the same when starting the client, we can specify a startup option called default-character-set when starting the client. For example, you can write this in the configuration file:

[client] default-character-set=utf8

It has the same effect as executing SET NAMES utf8, setting the values of those three system variables to utf8.

Application of comparison rules

At the end of the roaming of the character set, we focus again on the comparison rules, which usually reflect the expression of comparing the string size and sorting a string column, so it is sometimes called a collation rule. For example, the character set used in the column col of table t is gbk, and the comparison rule is gbk_chinese_ci, into which we insert several records:

Mysql > INSERT INTO t (col) VALUES ('a'), ('b'), ('A'), ('B'); Query OK, 4 rows affected (0.00 sec) Records: 4 Duplicates: 0 Warnings: 0mysql >

When we query, sort by t column:

Mysql > SELECT * FROM t ORDER BY col;+-+ | col | +-+ | a | | A | b | | B | | I | +-+ 5 rows in set (0.00 sec)

You can see that the default comparison rule, gbk_chinese_ci, is case-insensitive. We now change the comparison rule of column col to gbk_bin:

Mysql > ALTER TABLE t MODIFY col VARCHAR (10) COLLATE gbk_bin;Query OK, 5 rows affected (0.02 sec) Records: 5 Duplicates: 0 Warnings: 0

Because gbk_bin compares character encodings directly, it is case-sensitive, so let's take a look at the sorted query results:

Mysql > SELECT * FROM t ORDER BY FROM t ORDER BY + | s | +-+ | A | | B | | a | b | | I | +-+ 5 rows in set (0.00 sec) mysql >

So if you don't get the imaginary result when comparing strings or sorting a string column in the future, you need to think about whether it is more regular.

Tip: the corresponding numbers for each character in column `col` after encoding using the gbk character set are as follows:'A'-> 65 (decimal)'B'-> 66 (decimal)'a'-> 97 (decimal)'b'-> 98 (decimal)'I'- > 25105 (decimal) summary

A character set refers to the coding rules for a range of characters.

A comparison rule is a rule that compares the size of characters in a character set.

In MySQL, a character set can have several comparison rules, among which there is a default comparison rule, and a comparison rule must correspond to a character set.

To view the supported character sets and comparison rules in MySQL, the statements are as follows:

SHOW (CHARACTER SET | CHARSET) [LIKE matching pattern]; SHOW COLLATION [LIKE matching pattern]

MySQL has four levels of character sets and comparison rules

Server level

Character_set_server represents the server-level character set, and collation_server represents the server-level comparison rules.

Database level

You can specify character sets and comparison rules when creating and modifying databases:

CREATE DATABASE database name [[DEFAULT] CHARACTER SET character set name] [[DEFAULT] COLLATE comparison rule name]; ALTER DATABASE database name [[DEFAULT] CHARACTER SET character set name] [[DEFAULT] COLLATE comparison rule name]

Character_set_database represents the character set of the current database, and collation_database represents the comparison rules of the current default database. These two system variables are read-only and cannot be modified. If the current default database is not specified, the variable has the same value as the corresponding server-level system variable.

Table level

Specify the character set and comparison rules for the table when you create and modify the table:

CREATE TABLE table name (column information) [[DEFAULT] CHARACTER SET character set name] [COLLATE comparison rule name]]; ALTER TABLE table name [[DEFAULT] CHARACTER SET character set name] [COLLATE comparison rule name]

Column level

You can specify the character set and comparison rules for the column when you create and modify the column definition:

CREATE TABLE table name (column name string type [CHARACTER SET character set name] [COLLATE comparison rule name], other columns.); ALTER TABLE table name MODIFY column name string type [CHARACTER SET character set name] [COLLATE comparison rule name]

Character set conversion that occurs from sending a request to receiving the result:

In this process, the meanings of each system variable are as follows:

The system variable describes the character set used by the character_set_client server to decode the request. When the character_set_connection server processes the request, it changes the request string from character_set_client to the character set used by the character_set_connectioncharacter_set_results server to return data to the client.

In general, you want to keep the values of these three variables the same as the character set used by the client.

The client encodes the request string using the operating system's character set, and sends an encoded byte string to the server.

The server decodes the byte string sent by the client using the character set represented by character_set_client, and encodes the decoded string according to the character set represented by character_set_connection.

If the character set represented by character_set_connection is the same as the character set used by the column of the specific operation, do the corresponding operation directly, otherwise you need to convert the string in the request from the character set represented by character_set_connection to the character set used by the column of the specific operation.

The byte string obtained from a column is converted from the character set used by the column to the character set represented by character_set_results and sent to the client.

The client parses the received result set byte string using the character set of the operating system.

The role of comparison rules is usually reflected in expressions that compare string sizes and sort a string column.

So much for the introduction of "what are the MySQL character set and comparison rules". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Database

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report