Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the optional character encodings in Windows Notepad

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article mainly introduces the optional character coding in Windows Notepad, which is very detailed and has a certain reference value. Friends who are interested must read it!

A brief Analysis of optional character Encoding in Windows Notepad

@ lizheming in the group asked what the encoding options for saving files in Windows Notepad (notepad) mean.

This article will briefly test the behavior of Windows Notepad.

The encoding of ▲ Windows Notepad includes ANSI, Unicode, Unicode big endian, and UTF-8.

Warning

This article only describes the technical facts of a widely used software and does not mean that the author supports or opposes the use of the software.

In fact, the author recommends that you never use Windows Notepad to deal with computer program code at any time.

This article is only verified by an example of a simplified Chinese version of 64-bit Windows 7, for reference only. There is no guarantee that consistent results can be reproduced in other identical or different systems.

Be careful

This article makes a strict distinction between Unicode encoding and byte serialization.

Unicode coding only refers to the use of numbers (usually written as hexadecimal numbers) to represent one-to-one characters. The range of this number is only constrained by the Unicode standard and has nothing to do with computers.

Unicode byte serialization refers to the work of representing a number within the Unicode standard range into N bytes in order to be able to write to computer memory.

Test case

The test case is: "Kun Jin copy [break line] a [break line]". Handcuff is a kind of belief. )

The GBK and Unicode codes for all characters are as follows:

Kun GBK=EFBF Unicode=U+951F

Jin GBK=BDEF Unicode=U+65A4

Copy GBK=BFBD Unicode=U+62F7

The GBK and Unicode encodings of the following ASCII characters are the same as ASCII:

A=0x61 CR=0x0D LF=0x0A

(Windows has two characters with a newline character: CR+LF)

ANSI

Under the simplified Chinese system, ANSI is the GBK code defined by the national standard of the people's Republic of China.

The result of Windows Notepad storing this file using ANSI is as follows:

EF BF BD EF BF BD 0D 0A 61 0D 0A

Simply use GBK encoding to store all the characters. The highest bit is not a single byte of 1 and is equivalent to ASCII, otherwise double bytes.

Here we should pay attention to the problem of Endian [Note A]. You can see that the byte order here is big-endian.

But there is no need to emphasize "big-end first GBK"-- since GB2312, the standard stipulates that the storage mode is big-end first [Note B]. Later GBK and GB18030-2000 are backward compatible.

The trouble with ANSI is that it depends on the system-the ANSI of other language systems is not GBK, and opening GBK files is bound to be garbled. And the character set of GBK itself is too small.

(never say "I only use Chinese"-without the Unicode symbols, the emoji on the Internet cannot be typed.)

Unicode series

What Windows Notepad calls "Unicode", "Unicode big endian" and UTF-8 are all different byte serialization storage methods for the same Unicode encoding.

UTF-16 and BOM

Unicode here refers to UTF-16 [Note C]. UTF-16 is an extremely simple and crude serialization method-the vast majority of Unicode characters are within the range of U+0000~U+FFFF [note D], so write the original value of the Unicode code with two bytes per character.

Note that ASCII characters also have to waste twice as much space to store the 8-bit high 0x00Murray-because if you skip the high 8-bit 0, there is no other basis for hyphenation during parsing.

There are big-end and small-end problems for UTF-16-UTF-16 does not specify whether the big end of the byte comes first or the small end comes first. But UTF-16 does not contain information that represents byte order, so you can't manually see which parsing is not garbled.

The solution provided by Unicode is to serialize a zero-width unbroken space (U+FEFF ZERO WIDTH NO-BREAK SPACE) in the form of UTF-16 and insert it at the top of the file. In this way, the UTF-16 parser reads the first two bytes of the file. In the case of FE FF, the big end comes first, and the FF FE comes first.

This stuffed thing is called BOM (byte order mark).

It is worth mentioning that the zero-width unbroken space character is also often used as a valid character to break the word limit on various occasions. Including SegmentFault's Q & An and comments.

"Unicode" and "Unicode big endian" of notepad

Writing "Unicode" alone is not a complete expression of a storage method at all. Because this contains only encoding and no byte serialization.

I am not surprised that M$ has made such a mistake. Just memorize the conclusion: the "Unicode" of Windows Notepad is UTF-16.

Windows Notepad uses "Unicode" = the UTF-16 at the beginning of the small end, and the result of storing this file is as follows:

FF FE 1F 95 A4 65 F7 62 0D 000A 0061 000D 000A 00-BOM- U+FEFF 951F 65A4 62F7 000D 000A 0061 000D 000A

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report