使用PHP刮页面会导致意想不到的字符

好吧，我使用PHP从网页中抓取一些数据，并以某种方式拉入源文档中不存在的某些意外字符。我想这是因为我解释了错误的字符编码，虽然我不能确定如何解决这个问题使用PHP刮页面会导致意想不到的字符

这里是HTML的给我的错误

<tr> 
    <td>Aug 2013</td> 
    <td>TEDxColbyCollege</td> 
    <td> 
     <a href="/talks/daniel_h_cohen_for_argument_s_sake.html">Daniel H. Cohen: For argument’s sake</a>  </td> 
    . 
    . 
    . 
// more of the table

现在得到的字符串我附和/存储在一个试片DB是这样的：Daniel H. Cohen: For argumentÃ¢ÂÂs sake

我使用下面的代码加载HTML文档和刮

$html = file_get_contents('url_of_html_page_being_scrapped'); 
$doc = new DOMDocument(); 
$doc->loadHTML($html); 
$sxml = simplexml_import_dom($doc); 
$table = $sxml->xpath('//table'); 
foreach($tbl->tr as $vid) 
{ 
. 
. 
echo $vid->td[2]->a // line giving me the problem 
. 
. 
}

头的文件表明

<!doctype html> 
<html lang="en"> 
<head> 
<meta charset="utf-8"> 
. 
. 
</head>

所以我假设我的方法不正确解释的字符集，虽然我不确定我怎么可以指定这个，或者如果它甚至问题...也似乎发生在错误值：'任何洞察到发生了什么/我该如何解决它会为我不确定是真棒

更新后从@Patrick曼瑟一些建议，我试图解决在SO别处找到

主要有：

$html =stripslashes(mb_convert_encoding(file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8")); 
//AND 
$html = mb_convert_encoding(file_get_contents('http://www.ted.com/talks/quick-list?sort=date&order=desc&page=1'), "HTML-ENTITIES", "UTF-8");

两个导致输出出现像这样Daniel H. Cohen: For argumentâ€™s sake

来源

2013-08-06 brendosthoughts

'$ html = file_get_contents（'url_of_html_page_being_scrapped'）;'是那个页面，你把''？ –

不，我没有把任何东西放在'url_of_html_page_being_scrapped'文件的头部，如上面显示为<！doctype html> 。。 ' – brendosthoughts

这就是我的意思:)嗯，我不知道这是否会为你工作，但我有类似的问题，并在内容被加载的UTF8_encode（）做了窍门。我不知道这是否是一个不正当的黑客...但尝试它：'$ doc-> loadHTML（utf8_encode（$ html））;' –

虽然在HTML的头部用这条线在我的数据库表中回荡，以及当文本仍然出现配置错误文件（在显示数据时显示）'正确渲染

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

来源

2013-08-06 12:19:53 brendosthoughts

即使有适当的应用htmlspecialchars_decode(),html_entities_decode()和mb_convert_encoding()，这个问题很难摆脱。

我使用修改后的SebastiánGrignoli的forceUTF8()函数来完全清理字符串。我知道没有其他的喜欢它的PHP。

您可以找到函数here on github的一个版本。

如果您确实需要全面清理，无论涉及什么字符，这都会带来惊人的效果。

以下是来自readme的示例。

示例用法：

$utf8_string = Encoding::fixUTF8($garbled_utf8_string);

实例：

echo Encoding::fixUTF8("FÃ©dÃ©ration Camerounaise de Football"); 
echo Encoding::fixUTF8("FÃÃ©dÃÃ©ration Camerounaise de Football"); 
echo Encoding::fixUTF8("FÃÃÃ©dÃÃÃ©ration Camerounaise de Football"); 
echo Encoding::fixUTF8("FÃÃÃÃ©dÃÃÃÃ©ration Camerounaise de Football");

将输出：

Fédération Camerounaise de Football 
Fédération Camerounaise de Football 
Fédération Camerounaise de Football 
Fédération Camerounaise de Football

EDIT

另外，请注意，如果您使用的是基于Web的数据库浏览器（如phpMyAdmin），则可能会遇到DB中存储的字符编码与网页定义的编码之间的字符差异。我曾经遇到过存储在数据库中的情况是完全正确的，但它只是看起来错误的界面。

来源

2013-08-06 20:04:07 David

感谢您的建议我试了一下，我仍然没有得到正确编码的字符串返回，似乎已经有一个项目打开它的问题，并会留意它在将来可能使用！ – brendosthoughts

很高兴帮助！另外，如果所讨论的问题是[non-break space issue]（https://github.com/neitanod/forceutf8/issues/9），我似乎记得使用[unicode preg_replace]（http：// www.php.net/manual/en/regexp.reference.unicode.php）将这些字符转换为可管理的（即：'preg_replace（'/ \ p {Zs} /'，''，$ htmlString）'）。虽然这似乎很奇怪，如果这是你的问题。 – David

使用PHP刮页面会导致意想不到的字符

回答

相关问题