Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What kind of coding does utf-8 mean?

2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

Editor to share with you what code utf-8 refers to. I hope you will get something after reading this article. Let's discuss it together.

UTF-8 is a variable length character encoding for Unicode; it can be used to represent any character in the Unicode standard, and the first byte of its encoding is still compatible with ASCII, so that the software that used to deal with ASCII characters can continue to be used without or only a few modifications.

UTF-8 (8 bits, Universal Character Set/Unicode Transformation Format) is a variable length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte of its encoding is still compatible with ASCII, so that the software that used to deal with ASCII characters can continue to be used without or only a few modifications. As a result, it has gradually become the preferred coding in e-mail, web pages and other applications for storing or transmitting text.

Basic features

The UCS characters Ubun0000 to Ubun007F (ASCII) are encoded as bytes 0 × 00 to 0x7F (ASCI Ⅱ compatible). This means that files containing only 7-bit ASCIl characters are the same under both ASCI Ⅱ and UTF-8 encodings.

All UCS characters larger than 0x007F are encoded as a string with multiple bytes, each with a set of marked bits. Therefore, ASCIl bytes (0x00-0x7F) cannot be part of any other character. The first byte of a multibyte string that represents a non-ASCIl character is always in the range of 0xC0 to 0XFD, and indicates how many bytes the character contains. The rest of the multi-byte string is in the 0x80 to 0xBF range. This makes resynchronization very easy and makes the encoding borderless and less affected by lost bytes.

UTF-8 encoded characters can theoretically be up to 6 bytes long, while 16-bit BMP characters are only 3 bytes long at most. The sequence of Bigendian UCS-4 byte strings is predetermined, and the bytes 0xFE and OxFF have never been used in UTF-8 coding.

Number of encoded bytes

UTF-8 uses 1x4 bytes to encode each character:

A US-ASCIl character needs only 1 byte encoding (the Unicode range is U+0000~U+007F).

Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with consonant symbols require 2-byte coding (Unicode range is U+0080~U+07FF).

Characters in other languages (including Chinese, Japanese and Korean, Southeast Asian, Middle Eastern, etc.) contain most commonly used words and use 3-byte coding.

Other rarely used language characters use 4-byte encoding.

UTF-8 coding rules:

If there is only one byte, its highest binary bit is 0; if it is multibyte, its first byte starts from the highest bit, the number of consecutive binary bits with a value of 1 determines the number of bytes it encodes, and the remaining bytes start with 10.

After reading this article, I believe you have a certain understanding of "what code utf-8 refers to". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report