How to use UTF-8 coding in MySQL 07/02 Update SLTechnology News&Howtos

How to use UTF-8 coding in MySQL

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to use UTF-8 coding in MySQL. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

So what is coding? What is UTF-8?

As we all know, computers use zeros and ones to store text. For example, if the character "C" is saved as "01000011", then the computer needs to go through two steps to display this character:

1. The computer reads "01000011" and gets the number 67, because 67 is encoded as "01000011".

two。 The computer looked for 67 in the Unicode character set and found "C".

The same:

1. My computer maps "C" to 67 in the Unicode character set.

two。 My computer encodes 67 as "01000011" and sends it to the Web server.

Almost all network applications use the Unicode character set because there is no reason to use other character sets.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest, because computers have always treated 32 bits as numbers, and computers are best at dealing with numbers. But the problem is, it's a waste of space.

UTF-8 can save space. In UTF-8, the character "C" only needs 8 bits, and some less commonly used characters, such as "", need 32 bits. Other characters may use 16 or 24 bits. An article like this one, if you use UTF-8 coding, takes up only about 1/4 of the space of UTF-32.

MySQL's "utf8" character set is not compatible with other programs, and its so-called "" may really be a...

A brief history of MySQL

Why would MySQL developers invalidate "utf8"? We may be able to find the answer in the submission log.

MySQL has supported UTF-8 since version 4. 1, that is, in 2003, and the UTF-8 standard in use today (RFC 3629) came later.

The older UTF-8 standard (RFC 2279) supports up to 6 bytes per character. MySQL developers used RFC 2279 in * * MySQL 4.1 preview versions on March 28th, 2002.

In September of the same year, they made an adjustment to the MySQL source code: "UTF8 now supports a sequence of up to 3 bytes."

Who submitted the code? Why would he do that? The question is unknown. After migrating to Git (MySQL started with BitKeeper), many of the submitters in the MySQL code base were lost. There was no clue to explain the change in the September 2003 mailing list.

But I can try to guess.

In 2002, MySQL made a decision: if users can guarantee that each row of the data table uses the same number of bytes, then MySQL can make a big improvement in performance. To do this, the user needs to define the text column as "CHAR", and each "CHAR" column always has the same number of characters. If the number of characters inserted is less than the defined number, MySQL will fill in the spaces after it, and if the number of characters inserted exceeds the defined number, the excess will be truncated.

MySQL developers initially tried UTF- 8 with 6 bytes per character, CHAR (1) used 6 bytes, CHAR (2) used 12 bytes, and so on.

It should be said that their initial behavior was correct, but unfortunately this version has not been released. But it is written in the document, and it is widely circulated, and all people who know UTF-8 agree with what is written in the document.

But it is clear that MySQL developers or vendors are worried that users will do these two things:

1. Use CHAR to define columns (CHAR is now an antique, but at that time, it would be faster to use CHAR in MySQL, but not since 2005).

two。 Set the encoding of the CHAR column to utf8.

My guess is that MySQL developers wanted to help users who wanted a win-win situation in space and speed, but they screwed up the "utf8" code.

So it turns out there are no winners. Users who want a win-win situation in both space and speed actually use more space and slower than expected when using the CHAR column of "utf8". Users who want to be correct cannot save characters like "" when they use "utf8" encoding.

After this illegal character set is released, MySQL cannot fix it because it requires all users to rebuild their database. Finally, MySQL rereleased "utf8mb4" in 2010 to support real UTF-8.

Why does this make people so crazy?

Because of this problem, I freaked out for a whole week. I was fooled by "utf8" and it took me a lot of time to find this bug. But I am definitely not one of *. Almost all the articles on the Internet regard "utf8" as the real UTF-8.

"utf8" can only be regarded as a proprietary character set, it brings us new problems, but it has not been solved.

This is how to use UTF-8 coding in the MySQL shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.