Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Example Analysis of Unicode signature BOM in UTF-8 File

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/03 Report--

This article will explain in detail the example analysis of Unicode signature BOM in UTF-8 files. The editor thinks it is very practical, so I share it with you for reference. I hope you can get something after reading this article.

Recently, I encountered a strange thing when I was testing a UTF8-encoded Chinese Zen Cart website. The page showed normal text, but I used ie to view the source file (notepad was opened) but found garbled code. Firefox did not have this problem. After multi-party verification and many tests on the Internet, this problem has been solved, which is actually the problem of Unicode signature BOM (Byte Order Mark) of UTF-8 documents.

BOM (Byte Order Mark) is the standard tag used to identify the code in the UTF coding scheme. It is originally FF FE in UTF-16, but when it becomes UTF-8, it becomes EF BB BF. This flag is optional because UTF8 bytes are out of order, so it can be used to detect whether a byte stream is UTF-8 encoded. Microsoft does this test, but some software does not do this test, but treats it as a normal character.

Microsoft adds three bytes of EF BB BF in front of its own text file in UTF-8 format. Notepad and other programs on windows determine whether a text file is ASCII or UTF-8 based on these three bytes. However, this is only a mark made by Microsoft secretly, and there is no such mark for UTF-8 text files on other platforms.

In other words, an UTF-8 file may have BOM or no BOM, so how can you tell the difference? There are three ways. 1. Open the file with UltraEdit-32, switch to hexadecimal editing mode, and see if there is EF BB BF in the header of the file. 2. Open it with Dreamweaver and look at the properties of the page to see if there is a check before "including Unicode signature BOM". 3, open it with Windows's notepad and select "Save as" to see whether the default encoding of the file is UTF-8 or ANSI, and if it is ANSI, there is no BOM.

I found the html_header.php in the template file of Zen Cart and found that the file did not take BOM. After adding BOM in the way UltraEdit-32 was saved as, and then uploading html_header.php, everything was fine.

Note that when you use Convertz to convert gb2312 files to UTF-8 files, the default setting is without BOM. The above garbled code problems may occur without BOM, but with BOM, be careful with php include files, there will be more EF BB BF in front of the php byte stream, and early output to the monitor may lead to program errors. One solution is that all files that are include are saved as ANSI, and the master file can be UTF-8. To remove BOM from a file, use UlterEdit to open it, switch to hexadecimal editing mode, replace the first three bytes (the damn EF BB BF) with 20, save (note that automatic backup is turned off when saving), and then switch to the default editing mode, removing the first three spaces.

In addition, I also learned a little bit about coding: the file saved by the so-called unicode is actually utf-16, but it just happens to be the same as the code of unicode, but conceptually, unicode and utf are two different things. Unicode is the in-memory coding representation scheme, and utf is how to save and transfer unicode. Utf-16 can also be divided into two types: high-order first (LE) and high-order last (BE). The official utf code is also utf-32, which is also divided into LE and BE. The unofficial utf code of unicode is also utf-7, which is mainly used for mail transmission. The single-byte part of utf-8 is compatible with iso-8859-1, mainly because some old systems and library functions can not handle utf-16 correctly, and save file space for English characters (at the expense of non-English characters). In iso-8859-1, both utf8 and iso-8859-1 are represented by one byte, and utf-8 uses two or three bytes when representing other characters.

On the "UTF-8 file Unicode signature BOM example analysis" this article is shared here, I hope the above content can be of some help to you, so that you can learn more knowledge, if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report