In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)05/31 Report--
This article mainly explains "how to deal with multi-byte characters in Go language". The content in the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to deal with multi-byte characters in Go language".
1 Overview
Go language strings are encoded using UTF-8. UTF-8 is one of the ways to implement Unicode.
2 the relationship between UTF-8 and Unicode
Unicode is a character set designed by the International Organization for bidders (ISO) that includes all cultures, letters and symbols on earth. They call it Universal Multiple-Octet Coded Character Set, or UCS for short, or Unicode. Unicode assigns a unique code point (Code Point) to each character, which is a unique value. For example, Kang's code point is 24247 and hexadecimal is 5eb7.
The Unicode character set only defines the correspondence between characters and code points, but it does not define how to encode (store) this code value, which leads to a lot of problems. For example, due to the different code values of characters, the required storage space is inconsistent, and the computer is not sure how many bytes the next character will occupy. In addition, if you use a fixed length assumption of 4 bytes to store code point values, it will lead to additional waste of space, because ascii code characters actually need only one byte of space.
UTF-8 is a coding rule that solves how to design for Unicode coding. It can be said that UTF-8 is one of the ways to implement Unicode. It is characterized by a variable length coding that uses 1 to 4 bytes to represent a character and varies the length according to different symbols. There are two coding rules for UTF-8:
For a single-byte symbol, the first bit of the byte is set to 0, and the last seven bits are the Unicode code of the symbol. So for ASCII characters, the UTF-8 code and the ASCII code are the same.
For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, and the first two bits of the next byte are all set to 10. The rest of the unmentioned binary bits are all Unicode codes of this symbol.
The following are the coding rules:
Unicode | UTF-8--- 0000 0000-0000 007F | 0xxxxxxx0000 0080-0000 07FF | 110xxxxx 10xxxxxx0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxmuri- -
In the Go language, unicode and unicode/utf8 packages are used to implement Unicode and UTF-8. Here is a summary and explanation of reading API.
3 Unicode package
In the Go language, the Unicode package is provided to handle Unicode-related operations, as follows:
Is (rangeTab * RangeTable, r rune) bool
Detects whether rune r is within the character range specified by rangeTable.
RangeTable A collection of Unicode code values, usually using the set defined in the unicode package.
Determine whether the character appears in the set of Chinese characters:
Unicode.Is (unicode.Scripts ["Han"],'k') / / returns falseunicode.Is (unicode.Scripts ["Han"], 'Kang') / / returns true
In (r rune, ranges... * RangeTable) bool
Detects whether rune r is within the character range specified by multiple rangeTable.
RangeTable A collection of Unicode code values, usually using the set defined in the unicode package.
Unicode.In ('Kang', unicode.Scripts ["Han"], unicode.Scripts ["Latin"]) / / returns trueunicode.In ('Kang, unicode.Scripts ["Han"], unicode.Scripts ["Latin"]) / / returns true
IsOneOf (ranges [] * RangeTable, r rune) bool
Detects whether rune r is within the character range specified by rangeTable ranges. Similar to the In function, In is recommended.
IsSpace (r rune) bool
Detects whether the character rune r is a blank character. In the Latin-1 character space, the white space characters are:
'\ tweets,'\ nails,'\ vails,'\ fags,'\ rashes,'', Ubun0085 (NEL), U+00A0 (NBSP)
For other white space characters, see Policy Z and attribute Pattern_White_Space.
IsDigit (r rune) bool
Detects whether the character rune r is a decimal numeric character.
Unicode.IsDigit ('9') / / returns trueunicode.IsDigit ('k') / / returns false
IsNumber (r rune) bool
Detects whether the character rune r is a Unicode numeric character.
IsLetter (r rune) bool
Detect whether a character rune r is a letter
Unicode.IsLetter ('9') / / returns falseunicode.IsLetter ('k') / / returns true
IsGraphic (r rune) bool
Whether a character rune r is a unicode graphic character. Graphic characters include letters, marks, numbers, symbols, punctuation, and white space.
Unicode.IsGraphic ('9') / / returns trueunicode.IsGraphic (',') / / returns true
IsControl (r rune) bool
Detects whether a character rune r is a unicode control character.
IsMark (r rune) bool
Detects whether a character rune r is a marker character.
IsPrint (r rune) bool
Detect whether a character rune r is a printable character, which is basically the same as a graphic character, except for the ASCII white space character Ubun0020.
IsPunct (r rune) bool
Detects whether a character rune r is a unicode punctuation character.
Unicode.IsPunct ('9') / / returns falseunicode.IsPunct (',') / / returns true
IsSymbol (r rune) bool
Detects whether a character rune r is a unicode symbol character.
IsLower (r rune) bool
Detect whether a character rune r is a lowercase letter.
Unicode.IsLower ('h') / / returns trueunicode.IsLower ('H') / / returns false
IsUpper (r rune) bool
Detect whether a character rune r is an uppercase letter.
Unicode.IsUpper ('h') / / returns falseunicode.IsUpper ('H') / / returns true
IsTitle (r rune) bool
Detects whether a character rune r is a Title character. The Title format of most characters is its uppercase format, and the Title format of a few characters is special characters, such as "characters".
Unicode.IsTitle ('h') / / return trueunicode.IsTitle ('h') / / return falseunicode.IsTitle ('H') / / return true
To (_ case int, r rune) rune
Converts the character rune r to the specified format. The format _ case supports: unicode.UpperCase, unicode.LowerCase, unicode.TitleCase.
Unicode.To (unicode.UpperCase,'h') / / returns H
ToLower (r rune) rune
Converts the character rune r to lowercase.
Unicode.ToLower ('H') / / returns h
Func (SpecialCase) ToLower
Converts the character rune r to lowercase. Priority is given to the mapping table SpecialCase.
The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.
Unicode.TurkishCase.ToLower ('please') / / returns I
ToUpper (r rune) rune
Converts the character rune r to uppercase.
Unicode.ToUpper ('h') / / returns H
Func (SpecialCase) ToUpper
Converts the character rune r to uppercase. Priority is given to the mapping table SpecialCase.
The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.
Unicode.TurkishCase.ToUpper ('i') / / returns "
ToTitle (r rune) rune
Converts the character rune r to a Title character.
Unicode.ToTitle ('h') / / returns H
Func (SpecialCase) ToTitle
Converts the character rune r to a Title character. Priority is given to the mapping table SpecialCase.
The mapping table SpecialCase is a mapping table of case in a particular locale. It is mainly used in some European characters, such as Turkish TurkishCase.
Unicode.TurkishCase.ToTitle ('i') / / returns "
SimpleFold (r rune) rune
Look for the unicode code value corresponding to rune r in the unicode standard character mapping. Loop search to the direction where the code value is large. Correspondence refers to a variety of possible ways to write the same character.
Unicode.SimpleFold ('H') / / return hunicode.SimpleFold ('Φ') / / return φ
4 unicode/utf8 package
DecodeLastRune (p [] byte) (r rune, size int)
Decode the last UTF-8 encoding sequence in [] byte p, and return the code value and length.
Utf8.DecodeLastRune ([] byte) / / returns 35838 / 35838, which is the unicode value of the course.
DecodeLastRuneInString (s string) (r rune, size int)
Decodes the last UTF-8 encoding sequence in string s and returns the code value and length.
Utf8.DecodeLastRuneInString ("short story lesson") / / return 35838 / 35838 is the unicode code value of the class.
DecodeRune (p [] byte) (r rune, size int)
Decode the first UTF-8 encoding sequence in [] byte p, and return the code value and length.
Utf8.DecodeRune ([] byte) / / return 23567 / 23567 is a small unicode code value.
DecodeRuneInString (s string) (r rune, size int)
Decodes the first UTF-8 encoding sequence in string s and returns the code value and length.
Utf8.DecodeRuneInString ("short story lesson") / / return 23567 / 23567 is a small unicode code value.
EncodeRune (p [] byte, r rune) int
Writes the UTF-8 encoding sequence of rune r to [] byte p and returns the number of bytes written. P satisfies sufficient length.
Buf: = make ([] byte, 3) n: = utf8.EncodeRune (buf, 'Kang') fmt.Println (buf, n) / / output [229 186 183] 3
FullRune (p [] byte) bool
Detect whether [] byte p contains a full UTF-8 code.
Buf: = [] byte {229,186,183} / / Kang utf8.FullRune (buf) / / return trueutf8.FullRune (buf [: 2]) / / return false
FullRuneInString (s string) bool
Detect whether string s contains a full UTF-8 encoding.
Buf: = "Kang" / / Kang utf8.FullRuneInString (buf) / / return trueutf8.FullRuneInString (buf [: 2]) / / return false
RuneCount (p [] byte) int
Returns the number of UTF-8-encoded code values in [] byte p.
Buf: = [] byte ("short story lesson") len (buf) / / return 12utf8.RuneCount (buf) / / return 4
RuneCountInString (s string) (n int)
Returns the number of code values encoded by UTF-8 in string s.
Buf: = "short story class" len (buf) / / return 12utf8.RuneCountInString (buf) / / return 4
RuneLen (r rune) int
Returns the number of bytes encoded by rune r.
Utf8.RuneLen ('Kang') / / returns 3utf8.RuneLen ('H') / / returns 1
RuneStart (b byte) bool
Detects whether the byte byte b can be the first byte of a rune encoding.
Buf: = "short story lesson" utf8.RuneStart (buf [0]) / / return trueutf8.RuneStart (buf [1]) / / return falseutf8.RuneStart (buf [3]) / / return true
Valid (p [] byte) bool
Detect whether the slice [] byte p contains a complete and legal UTF-8 coding sequence.
Valid: = [] byte ("short story lesson") invalid: = [] byte {0xff, 0xfe, 0xfd} utf8.Valid (valid) / / return trueutf8.Valid (invalid) / / return false
ValidRune (r rune) bool
Detects whether the character rune r contains a complete and legitimate UTF-8 encoding sequence.
Valid: = 'a'invalid: = rune (0xfffffff) fmt.Println (utf8.ValidRune (valid)) / / return truefmt.Println (utf8.ValidRune (invalid)) / / return false
ValidString (s string) bool
Detects whether the string string s contains a complete and legal UTF-8 encoding sequence.
Valid: = "short Korean lesson" invalid: = string ([] byte {0xff, 0xfe, 0xfd}) fmt.Println (utf8.ValidString (valid)) / / return truefmt.Println (utf8.ValidString (invalid)) / / return false Thank you for your reading. This is the content of "how to deal with multi-byte characters in Go language". After the study of this article, I believe you have a deeper understanding of how to deal with multi-byte characters in Go language. The specific use situation still needs to be verified by practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.