Centos6 modifies the character environment to Chinese environment 07/15 Update SLTechnology News&Howtos

Centos6 modifies the character environment to Chinese environment

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/02 Report--

Due to the needs of the project, the character environment in a system needs to be changed to a Chinese environment. I didn't pay much attention to this before. I now record the operation as follows, and friends in need can have a general understanding of it.

First, basic knowledge: the information stored in the computer is represented by binary numbers, while the characters we see on the screen, such as English and Chinese characters, are the result of binary conversion. In popular terms, according to what rules characters are stored in the computer, such as what'a'is represented, it is called "coding"; on the contrary, the binary numbers stored in the computer are parsed and displayed, which is called "decoding", just like encryption and decryption in cryptography. In the decoding process, if the wrong decoding rules are used, it will cause'a'to be parsed into'b'or garbled.

Character set (Charset): a collection of all abstract characters supported by a system. Characters are the general name of all kinds of words and symbols, including national characters, punctuation marks, graphic symbols, numbers and so on.

Character coding (Character Encoding): a set of rules that can be used to match a set of natural language characters (such as the alphabet or syllable table) with a set of other things (such as numbers or electrical impulses). That is, to establish the corresponding relationship between the symbol set and the digital system, which is a basic technology of information processing. Usually people use a set of symbols (usually words) to express information. The computer-based information processing system uses the combination of different states of components (hardware) to store and process information. The combination of different states of components can represent the numbers of a digital system, so character coding is the number of symbols converted into a digital system acceptable to the computer, which is called a digital code.

Common character sets and character coding

Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB18030 character set, Unicode character set, etc. If the computer wants to process all kinds of character set text accurately, it needs to carry on the character coding, so that the computer can recognize and store all kinds of characters.

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Exchange) is a computer coding system based on the Latin alphabet. It is mainly used to display modern English, while its extended version of EASCII can barely display other Western European languages. It is the most common single-byte coding system today (but there are signs of being overtaken by Unicode) and is equivalent to the international standard ISO/IEC 646.

ASCII character set: mainly includes control characters (enter, backspace, newline key, etc.); displayable characters (English uppercase and lowercase characters, Arabic numerals and Spanish symbols).

ASCII encoding: a rule that converts an ASCII character set into a number of numeric systems acceptable to a computer. Use 7-bit (bits) to represent a character, a total of 128characters; but the 7-bit encoded character set can only support 128characters. To represent more common European characters, the ASCII extension character set uses 8-bit (bits) to represent a character, a total of 256characters. The mapping of the ASCII character set to numeric encoding rules is shown in the following figure:

The biggest disadvantage of ASCII is that it can only display 26 basic Latin letters, Arabic numerals and British punctuation, so it can only be used to display Modern American English (and when dealing with loanwords in English such as nave, Cafe é, é lite, etc., all accents have to be removed, even if doing so violates spelling rules). Although EASCII has solved the display problem of some Western European languages, it is still powerless for more other languages. So today's Apple computers have abandoned ASCII and switched to Unicode.

GBXXXX character set & coding

Computer invention and after a long time, only used in the United States and some western developed countries, ASCII can well meet the needs of users. However, after the Korean Dynasty also had a computer, in order to display Chinese, a set of coding rules must be designed to convert Chinese characters into numbers in a digital system acceptable to the computer.

Chinese experts cancel those strange symbols (that is, EASCII) after 127th, stipulating that a character less than 127has the same meaning as the original, but when two characters greater than 127are connected together, it represents a Chinese character, the first byte (he calls it high byte) is used from 0xA1 to 0xF7, and the latter byte (low byte) is from 0xA1 to 0xFE, so that we can combine about 7000 simplified Chinese characters. In these codes, mathematical symbols, Roman and Greek letters, and Japanese katakana are all compiled, and even the numbers, punctuation, and letters that already exist in ASCII have been re-coded with two bytes long, which is often called "full-width" characters, while those below 127th are called "half-width" characters.

The above coding rule is GB2312. GB2312 or GB2312-80 is the Chinese national standard simplified Chinese character set, the full name of which is "basic set of Chinese character coding for information interchange", also known as GB0, issued by the State Administration of Standards of China and implemented on May 1st, 1981. GB2312 code is widely used in Chinese mainland, and it is also used in Singapore and other places. Almost all Chinese mainland Chinese systems and international software support GB2312. The emergence of GB2312 basically meets the needs of computer processing of Chinese characters, and the Chinese characters it contains have covered 99.75% of the frequency used by Chinese mainland. GB2312 can not deal with the rare characters in people's names and ancient Chinese, which led to the emergence of GBK and GB 18030 Chinese character sets. The following figure shows the beginning of GB2312 coding (because it is very large, only the beginning is enumerated. For more information, please see the GB2312 simplified Chinese coding table):

As GB 2312-80 contains only 6763 Chinese characters, there are many Chinese characters, such as some simplified characters (such as "lo") after the launch of GB 2312-80, some famous characters (such as "yong" by former Chinese Premier Zhu Rongji), traditional characters used in Taiwan and Hong Kong, Japanese and Korean characters, and so on. So the manufacturer Microsoft made use of the unused coding space of GB 2312-80 and included all the characters of GB 13000.1-93 to develop the GBK coding. According to Microsoft, GBK is an extension of GB2312-80, that is, an extension of the CP936 code table (Code Page 936) (previously CP936 is exactly the same as GB2312-80), first implemented in the simplified Chinese version of Windows 95. Although GBK contains all the characters of GB 13000.1-93, the encoding is not the same. GBK itself is not a national standard, but it has been published as a "technical specification guidance document" by the Standardization Department of the State Bureau of Technical Supervision and the Department of Science, Technology and quality Supervision of the Ministry of Electronic Industry. The original GB13000 has not been adopted by the industry, and the subsequent national standard GB18030 is technically compatible with GBK rather than GB13000.

GB 18030, full name: national standard GB 18030-2005 "Information Technology Chinese coded character set", is the latest internal code character set, is the revised version of GB 18030-2000 "extension of the basic set of Chinese character coding character set for information technology information interchange". It is fully compatible with GB 2312-1980 and basically compatible with GBK. It supports all unified Chinese characters of GB 13000 and Unicode, with a total of 70244 Chinese characters. GB 18030 has the following main features:

Like UTF-8, it uses multi-byte encoding, and each word can consist of 1, 2, or 4 bytes.

The coding space is huge and a maximum of 1.61 million characters can be defined.

There is no need to use character-making areas to support the writing of ethnic minorities in China.

The scope of Chinese characters includes traditional Chinese characters as well as Japanese and Korean characters

BIG5 character set & coding

Big5, also known as Big five or Big five, is the most commonly used computer Chinese character set standard in the traditional Chinese (traditional Chinese) community, with a total of 13060 Chinese characters. Chinese codes are divided into internal codes and interchange codes. Big5 belongs to Chinese internal codes, and the well-known Chinese interchange codes are CCCII and CNS11643. Although Big5 is widely used in traditional Chinese-speaking areas such as Taiwan, Hong Kong and Macao, it has long been not a local national standard, but only an industry standard. The character sets of major systems such as Yitian Chinese system and Windows are based on Big5, but manufacturers add different character-making and character-building areas, which are derived into a variety of different versions. In 2003, Big5 was included in the appendix of CNS11643 Chinese standard interchange code, and gained a more formal status. This latest version is called Big5-2003.

The Big5 code is a double-byte character set that uses a double-eight-code storage method to place a word in two bytes. The first byte is called "high byte" and the second byte is called "low byte". High byte uses 0x81-0xFE, low byte uses 0x40-0x7E, and 0xA1-0xFE. In the partition of Big5:

0x8140-0xA0FE

Reserved for user-defined characters (word-making area)

0xA140-0xA3BF

Punctuation marks, Greek letters and special symbols, including in 0xA259-0xA261, are placed in nine Chinese characters for metrology.

0xA3C0-0xA3FE

Keep it. This area is not open for word-making.

0xA440-0xC67E

Commonly used Chinese characters are sorted first by strokes and then by radicals.

0xC6A1-0xC8FE

Reserved for user-defined characters (word-making area)

0xC940-0xF9D5

The second commonly used Chinese characters are also sorted by strokes and then by radicals.

0xF9D6-0xFEFE

Reserved for user-defined characters (word-making area)

Great idea Unicode

Like China, when computers are transmitted to countries around the world, GB232/GBK/GB18030/BIG5-like coding schemes are designed and implemented to suit local languages and characters. In this way, there is no problem to use it locally. Once it appears in the network, there will be garbled code when visiting each other because of incompatibility.

In order to solve this problem, a great idea came into being-Unicode. The Unicode coding system is designed to express any character in any language. It uses 4-byte numbers to represent each letter, symbol, or ideographic character (ideograph). Each number represents a unique symbol used at least in a certain language. Not all the numbers have been used, but the total has exceeded 65535, so 2 bytes of numbers are not enough. Characters shared by several languages are usually encoded with the same number unless there is a reasonable etymological reason not to do so. Regardless of this, each character corresponds to a number and each number corresponds to a character. That is, there is no ambiguity. There is no need to record "patterns" anymore. Utility 0041 always stands for'A', even if the language does not have the character'A'.

In the field of computer science, Unicode (Unicode, Universal Code, single Code, Standard Universal Code) is a standard in the industry, which enables computers to reflect dozens of languages in the world. Unicode is developed based on the universal character set (Universal Character Set) standard, and is also published in the form of books. Unicode is also constantly expanding, inserting more new characters in each new version. Up to now, the sixth edition of Unicode contains more than 100,000 characters (in 2005, the 100,000th character of Unicode was adopted and recognized as one of the standards), a set of code charts that can be used as a visual reference, a set of coding methods and a set of standard character encodings, a set of enumerations containing character features such as superscript and subscript, etc. The Unicode organization (The Unicode Consortium) is operated by a non-profit organization and leads the follow-up development of Unicode. Its goal is to replace the existing character coding scheme with the Unicode coding scheme, especially in the multilingual environment, there are only limited space and incompatibility problems.

It can be understood that Unicode is a character set and UTF-32/ UTF-16/ UTF-8 is three character encoding schemes. )

UCS & UNICODE

The Universal character set (Universal Character Set,UCS) is a standard character set defined by the ISO 10646 (or ISO/IEC 10646) standard developed by ISO. Historically, there have been two separate organizations trying to create a single character set, namely the International Organization for Standardization (ISO) and the Unicode Consortium of multilingual software manufacturers. ISO/IEC 10646 project developed by the former and Unicode project developed by the latter. Therefore, different standards were established at the beginning.

Around 1991, participants in both projects realized that the world did not need two incompatible character sets. As a result, they began to consolidate the work of both sides and work together to create a single coding table. Starting with Unicode 2.0, Unicode uses the same font and code as ISO 10646-1; ISO also promises that ISO 10646 will not assign values to UCS-4 encodings that exceed U+10FFFF to keep the two consistent. Both projects still exist and publish their respective standards independently. However, both Unicode Consortium and ISO/IEC JTC1/SC2 agree to keep the code tables of their standards compatible and work closely together to adjust any future extensions. At the time of release, Unicode generally uses the most common font about the font code, but ISO 10646 generally uses the Century font as much as possible.

UTF-16

Although there are a lot of Unicode characters, in fact most people will not use more than the first 65535 characters. As a result, there is another Unicode encoding called UTF-16 (because 16 bits = 2 bytes). UTF-16 encodes characters in the range of 0-65535 into 2 bytes, and if you really need to express Unicode characters in the rarely used "astral plane" that exceed this range of 65535, you need to use some weird tricks to do so. The most obvious advantage of UTF-16 coding is that it is twice as space efficient as UTF-32, because each character needs only 2 bytes to be stored (except for the 65535 range), rather than the 4 bytes in UTF-32. And if we assume that a string does not contain any characters in the star layer, we can still find the nth character in constant time, which is always a good inference until it is not true. The coding method is as follows:

If the character encoding U is less than 0x10000, that is, between 0 and 65535 in decimal, two bytes are used directly.

If the character encoding U is greater than 0x10000, since the maximum range of UNICODE encodings is 0x10FFFF, there are a total of 0xFFFFF encodings from 0x10000 to 0x10FFFF, that is, 20 bit are required to mark these encodings. The value from 0 to 0xFFFFF is represented by U', the first 10 bit is used as the high bit and 16 bit numerical 0xD800 to do the logical or operation, and the last 10 bit is used as the low bit and 0xDC00 to do the logical or operation, so the four byte constitute the U code.

There are some other less obvious disadvantages to UTF-32 and UTF-16 coding. Different computer systems save bytes in a different order. This means that the character U+4E2D may be saved as 4E2D or 2D 4e in UTF-16 encoding, depending on whether the system uses a big-endian or a little-endian. (for UTF-32 encoding, there are more possible byte arrangements. ) as long as the document does not leave your computer, it is safe-different programs on the same computer use the same byte order (byte order). But when we need to transfer this document between systems, maybe on the World wide Web, we need a way to indicate how our bytes are currently stored. Otherwise, the computer that receives the document will not know whether the two bytes of 4E2D represent U+4E2D or U+2D4E.

To solve this problem, multi-byte Unicode encoding defines a "byte order mark (Byte Order Mark)", which is a special non-printing character that you can include at the beginning of the document to indicate the byte order you are using. For UTF-16, the byte order mark is U+FEFF. If you receive a UTF-16-encoded document that starts with byte FF FE, you can be sure that its byte order is one way; if it starts with FE FF, you can be sure that the byte order is reversed.

UTF-8

UTF-8 (8-bit Unicode Transformation Format) is not only a kind of variable length character coding (fixed length code) for Unicode, but also a kind of prefix code. It can be used to represent any character in the Unicode standard, and the first byte of its encoding is still compatible with ASCII, so that the software that used to deal with ASCII characters can continue to be used without or only a few modifications. As a result, it has gradually become the preferred coding in e-mail, web pages and other applications for storing or transmitting text. The Internet Engineering working Group (IETF) requires that all Internet protocols support UTF-8 encoding.

UTF-8 encodes each character with one to four bytes:

Only one byte encoding is required for 128 US-ASCII characters (Unicode ranges from Ubun0000 to Ubun007F).

Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and Kazakh letters with additional symbols require two-byte coding (Unicode ranges from Upri0080 to U+07FF).

Characters in other basic multilingual planes (BMP), which contain most common words, are encoded using three bytes.

Characters in other rarely used Unicode auxiliary planes use four-byte encoding.

Very effective in dealing with ASCII characters that are often used. It is also no worse than UTF-16 in handling extended Latin character sets. For Chinese characters, it is better than UTF-32. At the same time, you have to trust me on this one, because I'm not going to show you its mathematical principle. Due to the nature of bit manipulation, there is no longer a byte order problem with UTF-8 A document encoded in utf-8 is the same bitstream between different computers.

In general, it is not possible for the number of code points in a Unicode string to determine the length required to display it, or where the cursor should be placed in the text buffer after displaying the string; combined characters, broadened fonts, non-printable characters, and right-to-left text are all attributions. So although the relationship between the number of characters and the number of code points in UTF-8 strings is more complex than UTF-32, it is rare to encounter different situations in practice.

Advantages

UTF-8 is a superset of ASCII. Because a pure ASCII string is also a legitimate UTF-8 string, the existing ASCII text does not need to be converted. Software designed for traditional extended ASCII character sets can usually be used with UTF-8 with little or no modification.

Sorting UTF-8 using standard byte-oriented sorting routines produces the same results as sorting based on Unicode code points. (although this is of limited usefulness, there is unlikely to be an acceptable order of text in any particular language or culture. )

Both UTF-8 and UTF-16 are standard encodings for Extensible markup language documents. All other encodings must be specified by explicit or text declaration.

Any byte-oriented string search algorithm can be used for UTF-8 data (as long as the input consists only of complete UTF-8 characters). However, care must be taken with regular expressions or other structures that contain character counting.

UTF-8 strings can be reliably identified by a simple algorithm. That is, the likelihood that a string will behave as a legitimate UTF-8 in any other encoding is low and decreases as the string length increases. For example, the character values C0MenC1JEF5 to FF never appear. For better reliability, you can use regular expressions to count illegal overlength and alternative values (you can see the regular expressions that validate UTF-8 strings on W3 FAQ: Multilingual Forms).

Shortcoming

Because each character uses a different number of byte codes, finding the nth character in a string is an O (N) complex operation-that is, the longer the string, the more time it takes to locate a particular character. At the same time, bit transformation is needed to encode characters into bytes and decode bytes into characters.

The above gives an overview of the character set and coding scheme, and here are some specific operations

First, you need to install the required character set. For convenience, install it directly as a package group. For example, I need a Chinese character set.

Yum groupinstall chinese-support (let's not talk about the configuration of specific sources here)

Change the character set profile in the system

Vim / etc/sysconfig/i18n

Add the following configuration and comment on the original configuration:

LANG= "zh_CN.UTF-8"

LANGUAGE= "zh_CN.GB18030:zh_CN.GB2312:zh_CN"

SUPPORTED= "zh_CN.UTF-8:zh_CN:zh:en_US.UTF-8:en_US:en"

SYSFONT= "lat0-sun16"

The font you need to upload, for example, I need Song style here.

Cd / usr/share/fonts/default

Mkdir-p. / truetype/simsun

Get the simsun.tcc file; if it is not available for download online, get it on the windows operating system

Upload the simsun.tcc file to the / usr/share/fonts/default/truetype/simsun directory

Execute mkfontscale and generate the fonts.scale file

Execute mkfontdir and generate the fonts.dir file

Perform fc-cache-fv caching to the system font

Execute fc-list: lang=zh to see if the added fonts already exist

Execute fc-match Arial-s to check font recognition and use order

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.