MySQL character set utf8, utf8mb3, utf8mb4 07/19 Update SLTechnology News&Howtos

MySQL character set utf8, utf8mb3, utf8mb4

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

First of all, to understand the MySQL character set, you need to go to the official documentation to see how the character set is introduced. If English is good, it should be no problem to read the official documents. Search the search box and you can find the relevant explanation. I'll sort it out here so that I can check it later. The character set is in the following section of the official document:

Chapter 10 Character Sets, Collations, Unicode

Https://dev.mysql.com/doc/refman/5.6/en/charset.html

I. character set setting

The MySQL database can do this:

1. Use multiple character sets to store strings.

2. Use multiple collations to compare strings.

3. Mix strings with different character sets or collations on the same server, the same database, or even the same table.

4. Enable the specification of character sets and collations at any level.

MySQL can be set to the following 40 characters:

Mysql > SHOW CHARACTER SET

+-- +

+-- +

| | big5 | Big5 Traditional Chinese | big5_chinese_ci | 2 |

| | dec8 | DEC West European | dec8_swedish_ci | 1 |

| | cp850 | DOS West European | cp850_general_ci | 1 |

| | hp8 | HP West European | hp8_english_ci | 1 |

| | swe7 | 7bit Swedish | swe7_swedish_ci | 1 |

| | tis620 | TIS620 Thai | tis620_thai_ci | 1 |

| | gb2312 | GB2312 Simplified Chinese | gb2312_chinese_ci | 2 |

| | cp1250 | Windows Central European | cp1250_general_ci | 1 |

| | gbk | GBK Simplified Chinese | gbk_chinese_ci | 2 |

| | utf8 | UTF-8 Unicode | utf8_general_ci | 3 |

| | ucs2 | UCS-2 Unicode | ucs2_general_ci | 2 |

| | cp866 | DOS Russian | cp866_general_ci | 1 |

| | cp852 | DOS Central European | cp852_general_ci | 1 |

| | cp1251 | Windows Cyrillic | cp1251_general_ci | 1 |

| | utf16 | UTF-16 Unicode | utf16_general_ci | 4 |

| | cp1256 | Windows Arabic | cp1256_general_ci | 1 |

| | cp1257 | Windows Baltic | cp1257_general_ci | 1 |

| | utf32 | UTF-32 Unicode | utf32_general_ci | 4 |

| | cp932 | SJIS for Windows Japanese | cp932_japanese_ci | 2 |

+-- +

40 rows in set (0.00 sec)

String expressions have a repertoire attribute, which can have two values:

ASCII: The expression can contain only characters in the Unicode range Utility 0000 to Utility 007F.

UNICODE: The expression can contain characters in the Unicode range Utility 0000 to U+10FFFF. This includes characters in the Basic Multilingual Plane (BMP) range (Utility 0000 to U+FFFF) and supplementary characters outside the BMP range (UBG 10000 to U+10FFFF).

Mentioned here: Basic Multilingual Plane (BMP) and supplementary characters

Basic Multilingual Plane (BMP): basic multilingual plane

Supplementary Multilingual Plane (SMP): multilingual supplementary plane

BMP already contains commonly used characters, while SMP is just some less commonly used characters, code points (characters). Such as the symbol of Emoji avatar, the symbol of playing cards and so on.

For more information about BMP and SMP, please see the explanation on wiki: https://en.wikipedia.org/wiki/Plane_(Unicode)

By default, the character set of the metadata table is utf8, which is set by the parameter character_set_system. The character_set_results parameter defaults to utf8, and when the query table data is returned to the client, this parameter is the character set that controls the returned structural data. If you want the server to pass the metadata results back to a different character set, use the SET NAMES statement to force the server to perform character set conversion. The client program can perform the conversion after receiving the result from the server. Clients are more efficient in performing transformations, but this option does not always apply to all clients.

SETNAMES'utf8'

There are default settings for character sets and collations at four levels: server, database, table, and column.

SuffixMeaning_aiAccent insensitive stress insensitive _ asAccent sensitive stress sensitive _ ciCase insensitive is not case sensitive _ cscase-sensitive is case sensitive _ binBinary binary

Set _ ci, as the name implies, _ ai is also included, explicitly case-insensitive and implicit stress insensitive.

Set _ cs, as the name implies, _ as is also included, explicit case sensitivity, implicit stress sensitivity.

Set up the MySQL server character:

Character-set-server

Method 1:

Mysqld

Mysqld-character-set-server=latin1

Mysqld-- character-set-server=latin1\

-- collation-server=latin1_swedish_ci

Method 2:

Cmake.-DDEFAULT_CHARSET=latin1

Or cmake.-DDEFAULT_CHARSET=latin1\

-DDEFAULT_COLLATION=latin1_german1_ci

The current server character set and collation can be determined from the values of the character_set_server and collation_server system variables. These variables can be changed at runtime.

Database Character Set and CollationCREATEDATABASEdb_name [[DEFAULT] CHARACTERSETcharset_name] [[DEFAULT] COLLATEcollation_name]

ALTERDATABASEdb_name [[DEFAULT] CHARACTERSETcharset_name] [[DEFAULT] COLLATEcollation_name]

The keyword SCHEMA can be used instead of DATABASE.

All database options are stored in a text file named db.opt that can be found in the database directory.

The CHARACTER SET and COLLATE clauses make it possible to create databases with different character sets and collations on the same MySQL server.

Check these two parameter settings in your database:

USEdb_name

SELECT@@character_set_database,@@collation_database

Table Character Set and Collation

The CREATE TABLE and ALTER TABLE statements have optional clauses for specifying the table character set and collation:

CREATETABLEtbl_name (column_list) [[DEFAULT] CHARACTERSETcharset_name] [COLLATEcollation_name]]

ALTERTABLEtbl_name [[DEFAULT] CHARACTERSETcharset_name] [COLLATEcollation_name]

Column Character Set and CollationEvery "character" column (that is, a column of type CHAR, VARCHAR, or TEXT) has a column character set and a column collation. Column definition syntax for CREATE TABLE and ALTER TABLE has optional clauses for specifying the column character set and collation:

Col_name {CHAR | VARCHAR | TEXT} (col_length) [CHARACTERSETcharset_name] [COLLATEcollation_name]

Col_name {ENUM | SET} (val_list) [CHARACTERSETcharset_name] [COLLATEcollation_name]

Character String Literal Character Set and CollationFor the simple statement SELECT 'string', the string has the connection default character set and collation defined by the character_set_connection and collation_connection system variables.

A character string literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:

[_ charset_name] 'string' [COLLATEcollation_name]

Examples:

SELECT'abc'

SELECT _ latin1'abc'

SELECT _ binary'abc'

SELECT _ utf8'abc'COLLATE utf8_danish_ci

The National Character SetStandard SQL defines NCHAR or NATIONAL CHAR as a way to indicate that a CHAR column should use some predefined character set. MySQL usesutf8 as this predefined character set. For example, these data type declarations are equivalent:

CHAR (10) CHARACTERSET utf8

NATIONALCHARACTER (10)

NCHAR (10)

As are these:

VARCHAR (10) CHARACTERSET utf8

NATIONALVARCHAR (10)

NVARCHAR (10)

NCHARVARCHAR (10)

NATIONALCHARACTERVARYING (10)

NATIONALCHARVARYING (10)

Character Set IntroducersA character string literal, hexadecimal literal, or bit-value literal may have an optional character set introducer and COLLATE clause, to designate it as a string that uses a particular character set and collation:

[_ charset_name] literal[COLLATEcollation _ name]

Character set introducers and the COLLATE clause are implemented according to standard SQL specifications.

Examples:

SELECT'abc'

SELECT _ latin1'abc'

SELECT _ binary'abc'

SELECT _ utf8'abc'COLLATE utf8_danish_ci

SELECT _ latin1 Xerox 4D7953514C;-- hexadecimal

SELECT _ utf8 0x4D7953514CCOLLATE utf8_danish_ci

SELECT _ latin1 bang 1000001m;-- binary

SELECT _ utf8 0b1000001COLLATE utf8_danish_ci

II. The conversion between utf8 (utf8mb3) and utf8mb4

10.9.8 Converting Between 3-Byte and 4-Byte Unicode Character Sets

The utf8mb3 and utf8mb4 character sets differ as follows:

Utf8mb3 supports only characters in the Basic Multilingual Plane (BMP). Utf8mb4 additionally supports supplementary characters that lie outside the BMP.

Utf8mb3 uses a maximum of three bytes per character. Utf8mb4 uses a maximum of four bytes per character.

Note

This discussion refers to the utf8mb3 and utf8mb4 character set names to be explicit about referring to 3-byte and 4-byte UTF-8 character set data. The exception is that in table definitions, utf8 is used because MySQL converts instances of utf8mb3specified in such definitions to utf8, which is an alias for utf8mb3.

Utf8mb4 and utf8 (utf8mb3) conversions are also easy to convert:

Converting 1.utf8 (utf8mb3) to utf8mb4 can store supplementary characters

Conversion of 2.utf8 (utf8mb3) to utf8mb4 may increase data storage space

3. For BMP character characters, utf8 (utf8mb3) is converted to utf8mb4 with the same code value, the same encoding, the same length, and no change.

4. For supplementary character characters, utf8mb4 is stored in 4 bytes, and because utf8mb3 cannot store supplementary character characters, there is no need to worry about character conversion during character set conversion.

5. The table structure needs to be adjusted during the conversion process: the field length of the columns in the table set by the utf8 (utf8mb3) character set variable length character data type (VARCHAR and text type), and fewer characters will be stored in utf8mb4. For all character data types (CHAR, VARCHAR, and text types), UTF8Mb4 columns can be indexed up to fewer characters than UTF8Mb3 columns. So before you convert, check the field type. Prevents the converted table from storing more data than the field defines, and the maximum number of bytes that can be stored in the field type length. Innodb index column: the maximum index column length is 767 bytes. For utf8mb3, it can index 255characters. For utf8mb4, it can index 191characters. If it cannot be satisfied after the conversion, then you need to change the column to index. The following is the number of bytes that make the index more compressed.

Note

For InnoDB tables that use COMPRESSED or DYNAMIC row format, you can enable the innodb_large_prefix option to permit index key prefixes longer than 767bytes (up to 3072 bytes). Creating such tables also requires the option valuesinnodb_file_format=barracuda and innodb_file_per_table=true.) In this case, enabling the innodb_large_prefixoption enables you to index a maximum of 1024 or 768 characters for utf8mb3 or utf8mb4 columns, respectively. For related information, see Section 14.8.1.7, "Limits on InnoDB Tables".

The preceding types of changes are most likely to be required only if you have very long columns or indexes. Otherwise, you should be able to convert your tables from utf8mb3 to utf8mb4 without problems, using ALTER TABLE as described previously.

6. One-to-one correspondence is also required for MySQL server character sets.

If the 7.master instance changes the character set, then slave needs to be changed accordingly.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.