Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of Perl Unicode

2025-04-07 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/02 Report--

This article will explain the example analysis of Perl Unicode for you in detail. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

Perl Unicode Total Raiders

The content of this article is applicable to perl5.8 and above.

Perlinternalform

In Perl's view, there are only two forms of strings. One is octets, or 8-bit sequence, which is commonly referred to as a byte array. Another utf8-encoded string, perl calls it string. In other words, Perl only knows two kinds of codes: Ascii (octets) and utf8 (string).

Utf8flag

So how does perl determine whether a string is octets or utf8-encoded? perl has no intelligence, it depends entirely on the utf8flag on the string. Within perl, the string structure consists of two parts: data and utf8flag. For example, the string "China" is stored inside perl as follows:

Utf8flag data

On China

If utf8flag is On, perl will treat China as a utf8 string, and if utf8flag is Off,perl, it will treat it as octets. All string-related functions, including regular expressions, are affected by utf8flag. Let's look at an example:

Program code:

UseEncode; usestrict; my$str= "China"; Encode::_utf8_on ($str); printlength ($str). "\ n"; Encode::_utf8_off ($str); printlength ($str). "\ n"

The result of the operation is:

Program code:

two

six

Here we use the _ utf8_on function and _ utf8_off function of the Encode module to switch on and off the string "China" utf8flag. As you can see, when utf8flag is open, "China" is treated as a utf8 string, so its length is when 2.utf8flag is closed, "China" is treated as octets (byte array), the length is 6 (my editor uses utf8 encoding, if your editor uses gb2312 encoding, then the length should be 4).

Let's look at an example of a regular expression:

Program code:

UseEncode; usestrict; my$a= "china---- China"; my$b= "china---- China"; Encode::_utf8_on ($a); Encode::_utf8_off ($b); $a=~s/\ Ware and Greg; $b=~s/\ Ware Grease; print$a, "\ n"; print$b, "\ n"

Running result:

Program code:

WidecharacterinprintatPerl Unicode.plline10.

China China

China

The result * line is a warning, which we will discuss later. The second line of the result indicates that when utf8flag is on,\ w in the regular expression can match Chinese, and vice versa.

How do I determine whether the utf8flag of a string is turned on? Use Encode::is_utf8 ($str). This function is not used to detect whether a string is utf8-encoded, but just to see if its utf8flag is turned on.

Eq

Eq is a string comparison operator. Eq returns true only if the content of the string is consistent and the state of the utf8flag is the same.

The theory is the above, we must understand, remember clearly! The following is the practical application.

Perl Unicode transcoding

If you have a string "China", it is encoded by gb2312. If its utf8flag is off, it will be treated as octets, and length () will return 4, which is usually not what you want. If you turn on its utf8flag, it will be treated as a utf8-encoded string. Because its original coding is gb2312, not utf8, this may lead to errors. Because the internal code ranges of gb2312 and utf8 partially overlap, in many cases, no errors will be reported, but perl may have mistakenly dismantled the characters. In serious cases, perl will call the police, saying that a byte is not a legal utf8 internal code.

The solution is obviously that if your string is not utf8 encoded, you should first convert it to utf8 encoding and leave its utf8flag on. For a gb2312-encoded string, you can use the

Program code:

$str=Encode::decode ("gb2312", $str)

To convert it to utf8 encoding and open utf8flag. If your string encoding is utf8, but utf8flag is not open, you can turn on utf8flag in any of the following three ways:

Program code:

$str=Encode::decode_utf8 ($str); $str=Encode::decode ("utf8", $str); Encode::_utf8_on ($str)

* one way is efficient, but it is not officially recommended. The function at the beginning of the underscore is an internal function and is generally not called from the outside out of politeness.

String concatenation

. Is a string concatenation operator. When concatenating two strings, if the utf8flag of both strings is Off, then the resulting string is also Off. If the utf8flag of any of these strings is On, then the utf8flag of the resulting string will be On. Connection strings do not change their original encoding, so if you concatenate two strings with different encodings, no matter how you transcode the string, there will always be a piece of garbled code. This situation must be avoided. Before concatenating two strings, you should make sure that they are encoded consistently. If necessary, transcode and then concatenate the string.

Basic principles of Perl Unicode programming

For any Perl Unicode string to be processed, 1) convert its encoding to utf8;2) turn on its utf8flag

String source

In order to apply the basic principles mentioned above, we first need to know the original encoding of strings and utf8flag switches. Here we discuss several cases.

1) command line parameters and standard input. A string from a command line argument or standard input (STDIN) whose encoding is related to locale. If your locale is zh_CN or zh_CN.gb2312, then the incoming string is gb2312 encoding, if your locale is zh_CN.gbk, then the incoming encoding is gbk, if your encoding is zh_CN.UTF8, then the incoming encoding is utf8. No matter what the encoding is, the utf8flag of the incoming string is closed.

2) the string in your source code. It depends on what kind of coding you use to write the source code. In editplus, you can view and change the code through File-> Save as. Under linux, you can cat a source code file. If the Chinese is displayed normally, the code of the source code is the same as that of locale. The utf8flag of the string in the source code is also closed.

If your source code contains Chinese, then you * follow this principle: 1) use utf8 coding when writing code, and 2) add useutf8; statements at the beginning of the file. In this way, the strings in your source code will be utf8-encoded and utf8flag has been opened.

3) read from the file. There is no doubt that your file is encoded as it is read in. After reading in, utf8flag is in off state.

4) crawl the web page. The page is encoded as it is, and utf8flag is the off status. The code of the site can be obtained from the response header or from the html tag. It is also possible that there is no coding in the response header and htmlhead, and this is a very impolite web page. At this time, you can only use the program to guess:

Program code:

UseEncode; useLWP::Simpleqw (get); usestrict; my$str=get "http://www.sina.com.cn"; eval {my$str2=$str;Encode::decode (" gbk ", $str2,1)}; print" notgbk:$@\ n "if$@; eval {my$str2=$str;Encode::decode (" utf8 ", $str2,1)}; print" notutf8:$@\ n "if$@; eval {my$str2=$str;Encode::decode (" big5 ", $str2,1)}; print" notbig5:$@\ n "if$@

Output:

Program code:

Notutf8:utf8 "\ xD0" doesnotmaptoPerl Unicodeat/usr/local/lib/perl/5.8.8/Encode.pmline162. Notbig5:big5-eten "\ xC8" doesnotmaptoPerl Unicodeat/usr/local/lib/perl/5.8.8/Encode.pmline162.

We pass the third argument to the decode function, which requires an error when there are abnormal characters. We use eval to catch errors, transcoding failure indicates that the string is not this kind of encoding. Also notice that we copy $str to $str2 every time, because the third argument of decode is 1, and after decode, the string argument passed to it (the second argument will be cleared). Let's copy it so that $str2,$str remains the same each time it is emptied.

Let's look at the results. Since it's neither utf8 nor big5, it should be gbk. For other uncoded strings, you can also use this method to guess. However, because the internal code ranges of several encodings are similar, if the string is relatively short, there may be no abnormal characters, so this method is only suitable for large segments of text.

Output

After the string is handled correctly in the program, it should be presented to the user. At this point, we need to convert the string from perlinternalform to a form acceptable to the user. To put it simply, it converts the string from utf8 encoding to output encoding or interface encoding. At this point, we use $str=Encode::encode ('charset',$str);. It can also be divided into several situations.

1) Standard output. The coding of standard output is the same as that of locale. Utf8flag should be turned off when you output, or there will be the warning line we saw earlier:

Program code:

WidecharacterinprintatPerl Unicode.plline10.

2) GUI program. This should be nothing to do, utf8 coding, utf8flag on the line. It has not been actually tested.

3) do httppost. It doesn't matter whether utf8 flag is turned on or off, because httppost sends out only the data portion of the string, regardless of utf8flag.

PerlIO

PerlIO provides convenience for our input / output transcoding. It can automatically transcode and open utf8flag for you when entering a file handle, and automatically transcode and close utf8flag for you when exporting. Suppose your terminal locale is gb2312, take a look at the following example:

Program code:

Usestrict; binmode (STDIN, ": encoding (gb2312)"); binmode (STDOUT, ": encoding (gb2312)"); while () {chomp; print$_,length, "\ n";}

Enter "China" after operation, and the result:

Program code:

China 2

This saves us the trouble of transcoding in input and output. PerlIO can act on any file handle, please refer to perldocPerlIO.

Related API

All belong to the Encode module:

Octets=encode (ENCODING,$string [, CHECK]) converts the string from the utf8 encoding to the specified encoding and closes utf8flag.

$string=decode (ENCODING,$octets [, CHECK]) converts the string from other encoding to utf8 encoding and turns on utf8flag, with the exception that utf8flag is not enabled if the string is only ascii-encoded or EBCDIC-encoded.

Is_utf8 (STRING [, CHECK]) to see if utf8flag is turned on. If the second parameter is true, it also checks whether the encoding conforms to utf8. This test is not necessarily accurate, and the effect is the same as that of decode.

_ utf8_on (STRING) opens the utfflag of the string

_ utf8_off (STRING) turns off the utfflag of the string

* two are internal functions and are not recommended.

Reference Perl Unicode

Utf8 and utf-8

What we mentioned earlier has always been utf8. In perl, utf8 is different from utf-8. Utf-8 refers to the definition of utf-8 in the international standard, while utf8 is perl which has made some extensions to the international standard and has more compatible internal codes than those in the international standard. Perlaccounform uses utf8. By the way, the name of the character set is case-insensitive and "_" and "-" are equivalent.

EBCDIC

EBCDIC is a legacy wide-character solution, unlike Perl Unicode, it is not a superset of Ascii. The scheme described above is not entirely applicable to EBCDIC. For EBCDIC, please refer to perldocperlebcdic

This is the end of this article on "sample Analysis of Perl Unicode". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report