What's the difference between mysql encoding format? 04/17 Update SLTechnology News&Howtos

What's the difference between mysql encoding format?

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what is the difference between mysql encoding format, has a certain reference value, and friends who need it can refer to it. I hope you all have a lot to gain after reading this article. Let's take a look at it together.

1. Introduction to Character Set

Character is a general term for various characters and symbols, including various national characters, punctuation marks, graphic symbols, numbers, etc.

Character set is a collection of multiple characters. There are many types of character sets, and the number of characters contained in each character set is different. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, etc. Computer to accurately process all kinds of character sets of text, the need for character coding, so that the computer can recognize and store all kinds of text.

Character encoding is the encoding of a character in a character set into a Character in a specified character set so that text can be stored in a computer and transmitted through a communication network. Common examples include encoding the Latin alphabet into ASCII, ASCII numbering letters, numbers, and other symbols, and representing them in 7-bit binary.

A collation is a comparison rule between characters in the same character set. Only after determining the character order can you define what is equivalent to characters in a character set and the size relationships between characters. A character can contain multiple character sequences. MySQL endian naming rules are: start with the character set name corresponding to the endian, center with the country name (or general), and end with ci, cs, or bin. Character endianness ending in ci indicates case-insensitive, character endianness ending in cs indicates case-sensitive, and character endianness ending in bin indicates binary-encoded value comparisons.

2. ASCII encoding

ASCII is not only a coded character set, but also a character code. ASCII directly stores the serial number of characters in the coded character set as characters in the computer.

For example: in ASCII A character in the table row 65, the serial number is 65, and the encoded A value is 0100 0001, that is, decimal 65 binary conversion results.

Latin1 character set

The Latin1 character set extends the ASCII character set, still using one byte to represent characters, but with high bits enabled, extending the range of character set representations.

4. UTF-8 encoding

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode. It was founded by Ken Thompson in 1992. It is now standardized as RFC 3629. UTF-8 encodes Unicode characters in 1 to 6 bytes.

UTF-8 is a variable length byte encoding scheme. For UTF-8 encoding of a character, if there is only one byte, the most significant binary bit is 0; if it is a multibyte, the first byte starts from the most significant bit, the number of consecutive binary bits with values of 1 determines the number of bits encoded, and the rest of the bytes start with 10. UTF-8 can be up to 6 bytes long. As shown in the table:

1 byte 0xxxxxxx

2 bytes 110xxxxx 10xxxxxx

3 bytes 1110xxxx 10 xxxx 10 xxxx

4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5 bytes 11110 xx 10 xxxxxx

6 bytes 111110 x 10 xxxxxx 10 xxxxxx

Therefore, the actual number of bits that can be used to represent character encoding in UTF-8 is at most 31 bits, i.e., the bits represented by x in the table above. Except for the control bits (10 at the beginning of each byte, etc.), the bits represented by x correspond to UNICODE codes one by one, and the order of bits is also the same.

The actual conversion of UNICODE to UTF-8 encoding should be done by removing the high order zeros and then determining the minimum number of UTF-8 encoding bits required based on the number of bits remaining in the encoding. Thus characters in the basic ASCII character set (UNICODE compatible ASCII) require only one byte of UTF-8 encoding (7 binary bits) to represent.

Thank you for reading this article carefully. I hope Xiaobian can share what is the difference between mysql encoding format and help everyone. At the same time, I hope everyone will support you a lot. Pay attention to the industry information channel. If you encounter problems, find detailed solutions waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.