In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly shows you "how to solve PHP parsing html class library simple_html_dom transcoding bug", the content is easy to understand, well-organized, hope to help you solve your doubts, the following let Xiaobian lead you to study and learn "how to solve PHP parsing html class library simple_html_dom transcoding bug" this article.
I have been using simple_html_dom to grab some articles these days. The coding of different websites is basically gbk gb2312 utf-8 in China. Most of them were gb2312 and utf-8.
There is a way for my version of simple_html_dom that convert_text looks like this.
/ / PaperG-Function to convert the text from one character set to another if the two sets are not the same.
Function convert_text ($text)
{
Global $debug_object
If (is_object ($debug_object)) {$debug_object- > debug_log_entry (1);}
$converted_text = $text
$sourceCharset = ""
$targetCharset = ""
If ($this- > dom)
{
$sourceCharset = strtoupper ($this- > dom- > _ charset)
$targetCharset = strtoupper ($this- > dom- > _ target_charset)
}
If (is_object ($debug_object)) {$debug_object- > debug_log (3, "source charset:". $sourceCharset. "target charaset:" $targetCharset);}
If (! empty ($sourceCharset) & &! empty ($targetCharset) & & (strcasecmp ($sourceCharset, $targetCharset)! = 0))
{
/ / Check if the reported encoding could have been incorrect and the text is actually already UTF-8
If ((strcasecmp ($targetCharset, 'UTF-8') = = 0) & & ($this- > is_utf8 ($text))
{
$converted_text = $text
}
Else
{
$converted_text = iconv ($sourceCharset, $targetCharset, $text)
}
}
/ / Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.
If ($targetCharset = = 'UTF-8')
{
If (substr ($converted_text, 0,3) = "\ xef\ xbb\ xbf")
{
$converted_text = substr ($converted_text, 3)
}
If (substr ($converted_text,-3) = "\ xef\ xbb\ xbf")
{
$converted_text = substr ($converted_text, 0,-3)
}
}
Return $converted_text
}
Let's take a look at this line:
The copy code is as follows:
$converted_text = iconv ($sourceCharset, $targetCharset, $text)
Can cause incorrect transcoding. For example, the text of gb2312 will be translated into:
The copy code is as follows:
24-year-old Han Zhuangzhuang not only got zero penalty points at the 2014 Langqin International Equestrian World Cup Chinese League qualifying match held at the FIFA Equestrian Park on April 26. Zhao Zhiwen, the seventh Olympic rider, scored a zero penalty in 77.07 seconds.
It is an established fact, which proves that the transcoding function has not been handled properly. Because I only want to use this simple_html_dom to build dom. I'm not going to take the time to deal with this bug well. But simply put
The copy code is as follows:
$converted_text = iconv ($sourceCharset, $targetCharset, $text)
Change to
The copy code is as follows:
$converted_text = $text
The above is all the contents of this article "how to solve the transcoding bug of PHP parsing html class library simple_html_dom". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.