How to deal with multi-byte characters in Go language 04/19 Update SLTechnology News&Howtos

How to deal with multi-byte characters in Go language

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to deal with multi-byte characters in Go language". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to deal with multi-byte characters in Go language".

1 Overview

Go language strings are encoded using UTF-8. UTF-8 is one of the ways to implement Unicode.

2 the relationship between UTF-8 and Unicode

Unicode is a character set designed by the International Organization for bidders (ISO) that includes all cultures, letters and symbols on earth. They call it Universal Multiple-Octet Coded Character Set, or UCS for short, or Unicode. Unicode assigns a unique code point (Code Point) to each character, which is a unique value. For example, Kang's code point is 24247 and hexadecimal is 5eb7.

The Unicode character set only defines the correspondence between characters and code points, but it does not define how to encode (store) this code value, which leads to a lot of problems. For example, due to the different code values of characters, the required storage space is inconsistent, and the computer is not sure how many bytes the next character will occupy. In addition, if you use a fixed length assumption of 4 bytes to store code point values, it will lead to additional waste of space, because ascii code characters actually need only one byte of space.

UTF-8 is a coding rule that solves how to design for Unicode coding. It can be said that UTF-8 is one of the ways to implement Unicode. It is characterized by a variable length coding that uses 1 to 4 bytes to represent a character and varies the length according to different symbols. There are two coding rules for UTF-8:

For a single-byte symbol, the first bit of the byte is set to 0, and the last seven bits are the Unicode code of the symbol. So for ASCII characters, the UTF-8 code and the ASCII code are the same.

For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are all set to 10. The rest of the unmentioned binary bits are all Unicode codes of this symbol.

The following are the coding rules:

In the Go language, unicode and unicode/utf8 packages are used to implement Unicode and UTF-8. Here is a summary and explanation of reading API.

3 Unicode package

In the Go language, the Unicode package is provided to handle Unicode-related operations, as follows:

Is (rangeTab * RangeTable, r rune) bool

Detects whether rune r is within the character range specified by rangeTable.

RangeTable A collection of Unicode code values, usually using the set defined in the unicode package.

Determine whether the character appears in the set of Chinese characters:

Unicode.Is (unicode.Scripts ["Han"],'k') / / returns falseunicode.Is (unicode.Scripts ["Han"], 'Kang') / / returns true

In (r rune, ranges... * RangeTable) bool

Detects whether rune r is within the character range specified by multiple rangeTable.

RangeTable A collection of Unicode code values, usually using the set defined in the unicode package.

Unicode.In ('Kang', unicode.Scripts ["Han"], unicode.Scripts ["Latin"]) / / returns trueunicode.In ('Kang, unicode.Scripts ["Han"], unicode.Scripts ["Latin"]) / / returns true

IsOneOf (ranges [] * RangeTable, r rune) bool

Detects whether rune r is within the character range specified by rangeTable ranges. Similar to the In function, In is recommended.

IsSpace (r rune) bool

Detects whether the character rune r is a blank character. In the Latin-1 character space, the white space characters are:

'\ tweets,'\ nails,'\ vails,'\ fags,'\ rashes,'', Ubun0085 (NEL), U+00A0 (NBSP)

For other white space characters, see Policy Z and attribute Pattern_White_Space.

IsDigit (r rune) bool

Detects whether the character rune r is a decimal numeric character.

Unicode.IsDigit ('9') / / returns trueunicode.IsDigit ('k') / / returns false

IsNumber (r rune) bool

Detects whether the character rune r is a Unicode numeric character.

IsLetter (r rune) bool

Detect whether a character rune r is a letter

Unicode.IsLetter ('9') / / returns falseunicode.IsLetter ('k') / / returns true

IsGraphic (r rune) bool

Whether a character rune r is a unicode graphic character. Graphic characters include letters, marks, numbers, symbols, punctuation, and white space.

Unicode.IsGraphic ('9') / / returns trueunicode.IsGraphic (',') / / returns true

IsControl (r rune) bool

Detects whether a character rune r is a unicode control character.

IsMark (r rune) bool

Detects whether a character rune r is a marker character.

IsPrint (r rune) bool

Detect whether a character rune r is a printable character, which is basically the same as a graphic character, except for the ASCII white space character Ubun0020.

IsPunct (r rune) bool

Detects whether a character rune r is a unicode punctuation character.

Unicode.IsPunct ('9') / / returns falseunicode.IsPunct (',') / / returns true

IsSymbol (r rune) bool

Detects whether a character rune r is a unicode symbol character.

IsLower (r rune) bool

Detect whether a character rune r is a lowercase letter.

Unicode.IsLower ('h') / / returns trueunicode.IsLower ('H') / / returns false

IsUpper (r rune) bool

Detect whether a character rune r is an uppercase letter.

Unicode.IsUpper ('h') / / returns falseunicode.IsUpper ('H') / / returns true

IsTitle (r rune) bool

Detects whether a character rune r is a Title character. The Title format of most characters is its uppercase format, and the Title format of a few characters is special characters, such as "characters".

Unicode.IsTitle ('h') / / return trueunicode.IsTitle ('h') / / return falseunicode.IsTitle ('H') / / return true

To (_ case int, r rune) rune

Converts the character rune r to the specified format. The format _ case supports: unicode.UpperCase, unicode.LowerCase, unicode.TitleCase.

Unicode.To (unicode.UpperCase,'h') / / returns H

ToLower (r rune) rune

Converts the character rune r to lowercase.

Unicode.ToLower ('H') / / returns h

Func (SpecialCase) ToLower

Converts the character rune r to lowercase. Priority is given to the mapping table SpecialCase.

The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.

Unicode.TurkishCase.ToLower ('please') / / returns I

ToUpper (r rune) rune

Converts the character rune r to uppercase.

Unicode.ToUpper ('h') / / returns H

Func (SpecialCase) ToUpper

Converts the character rune r to uppercase. Priority is given to the mapping table SpecialCase.

The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.

Unicode.TurkishCase.ToUpper ('i') / / returns "

ToTitle (r rune) rune

Converts the character rune r to a Title character.

Unicode.ToTitle ('h') / / returns H

Func (SpecialCase) ToTitle

Converts the character rune r to a Title character. Priority is given to the mapping table SpecialCase.

The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.

Unicode.TurkishCase.ToTitle ('i') / / returns "

SimpleFold (r rune) rune

Look for the unicode code value corresponding to rune r in the unicode standard character mapping. Loop search to the direction where the code value is large. Correspondence refers to a variety of possible ways to write the same character.

Unicode.SimpleFold ('H') / / return hunicode.SimpleFold ('Φ') / / return φ

4 unicode/utf8 package

DecodeLastRune (p [] byte) (r rune, size int)

Decode the last UTF-8 encoding sequence in [] byte p, and return the code value and length.

Utf8.DecodeLastRune ([] byte) / / returns 35838 / 35838, which is the unicode value of the course.

DecodeLastRuneInString (s string) (r rune, size int)

Decodes the last UTF-8 encoding sequence in string s and returns the code value and length.

Utf8.DecodeLastRuneInString ("short story lesson") / / return 35838 / 35838 is the unicode code value of the class.

DecodeRune (p [] byte) (r rune, size int)

Decode the first UTF-8 encoding sequence in [] byte p, and return the code value and length.

Utf8.DecodeRune ([] byte) / / return 23567 / 23567 is a small unicode code value.

DecodeRuneInString (s string) (r rune, size int)

Decodes the first UTF-8 encoding sequence in string s and returns the code value and length.

Utf8.DecodeRuneInString ("short story lesson") / / return 23567 / 23567 is a small unicode code value.

EncodeRune (p [] byte, r rune) int

Writes the UTF-8 encoding sequence of rune r to [] byte p and returns the number of bytes written. P satisfies sufficient length.

Buf: = make ([] byte, 3) n: = utf8.EncodeRune (buf, 'Kang') fmt.Println (buf, n) / / output [229 186 183] 3

FullRune (p [] byte) bool

Detect whether [] byte p contains a full UTF-8 code.

Buf: = [] byte {229,186,183} / / Kang utf8.FullRune (buf) / / return trueutf8.FullRune (buf [: 2]) / / return false

FullRuneInString (s string) bool

Detect whether string s contains a full UTF-8 encoding.

Buf: = "Kang" / / Kang utf8.FullRuneInString (buf) / / return trueutf8.FullRuneInString (buf [: 2]) / / return false

RuneCount (p [] byte) int

Returns the number of UTF-8-encoded code values in [] byte p.

Buf: = [] byte ("short story lesson") len (buf) / / return 12utf8.RuneCount (buf) / / return 4

RuneCountInString (s string) (n int)

Returns the number of code values encoded by UTF-8 in string s.

Buf: = "short story class" len (buf) / / return 12utf8.RuneCountInString (buf) / / return 4

RuneLen (r rune) int

Returns the number of bytes encoded by rune r.

Utf8.RuneLen ('Kang') / / returns 3utf8.RuneLen ('H') / / returns 1

RuneStart (b byte) bool

Detects whether the byte byte b can be the first byte of a rune encoding.

Buf: = "short story lesson" utf8.RuneStart (buf [0]) / / return trueutf8.RuneStart (buf [1]) / / return falseutf8.RuneStart (buf [3]) / / return true

Valid (p [] byte) bool

Detect whether the slice [] byte p contains a complete and legal UTF-8 coding sequence.

Valid: = [] byte ("short story lesson") invalid: = [] byte {0xff, 0xfe, 0xfd} utf8.Valid (valid) / / return trueutf8.Valid (invalid) / / return false

ValidRune (r rune) bool

Detects whether the character rune r contains a complete and legitimate UTF-8 encoding sequence.

Valid: = 'a'invalid: = rune (0xfffffff) fmt.Println (utf8.ValidRune (valid)) / / return truefmt.Println (utf8.ValidRune (invalid)) / / return false

ValidString (s string) bool

Detects whether the string string s contains a complete and legal UTF-8 encoding sequence.

Valid: = "short Korean lesson" invalid: = string ([] byte {0xff, 0xfe, 0xfd}) fmt.Println (utf8.ValidString (valid)) / / return truefmt.Println (utf8.ValidString (invalid)) / / return false Thank you for your reading. This is the content of "how to deal with multi-byte characters in Go language". After the study of this article, I believe you have a deeper understanding of how to deal with multi-byte characters in Go language. The specific use situation still needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.