如何将unicode代码点转换为十六进制HTML实体？

我有一个数据文件（准确地说是Apple plist），它有Unicode codepoints，如\U00e8和\U2019。我需要使用PHP将它们转换为有效的十六进制HTML entities。如何将unicode代码点转换为十六进制HTML实体？

我在做什么，现在是长长的一串：

$fileContents = str_replace("\U00e8", "&#xe8;", $fileContents); 
$fileContents = str_replace("\U2019", "&#x2019;", $fileContents);

这显然是可怕的。我可以使用一个正则表达式将\U和所有尾随的0s转换为&#x，然后粘在尾随的;上，但这看起来也很笨拙。

是否有一种干净，简单的方法来取一个字符串，并将所有的unicode代码点替换为HTML实体？

来源

2010-08-13 Tina Marie

PCRE正则表达式非常快速和安全;我会使用它们。（其他的官方解决方案也可能使用正则表达式，或者查找表，这是你现在拥有的。） – MvanGeest 2010-08-13 19:30:29

根据[本页]（http://code.google.com/p/networkpx/wiki/PlistSpec）），那些转义序列表示UTF-16代码单元，而不是Unicode代码点。这意味着您可能必须将两个连续的代码单元（如果它们形成代理对）组合成一个HTML实体。 – Artefacto 2010-08-13 21:30:56

您可以使用preg_replace：

preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/', '&#x\1;', $fileContents);

测试RE：

PS> 'some \U00e8 string with \U2019 embedded Unicode' -replace '\\U0*([0-9a-f]{1,5})','&#x$1;' 
some &#xe8; string with &#x2019; embedded Unicode

来源

2010-08-13 19:34:03 Joey

似乎是一个明确的正则表达式用例。 @Tina Marie，如果您需要更多plist处理，请查看http://code.google.com/p/cfpropertylist/。 – 2010-08-13 19:37:12

是的，使用CFPropertyList。很棒！ – 2010-08-13 19:52:24

这里有一个正确的答案，这与事实的是代码单元，而不是代码点交易，并允许unencoding补充字符。

function unenc_utf16_code_units($string) { 
    /* go for possible surrogate pairs first */ 
    $string = preg_replace_callback(
     '/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i', 
     function ($matches) { 
      $hi_surr = hexdec($matches[1]); 
      $lo_surr = hexdec($matches[2]); 
      $scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) | 
       ($lo_surr & 0x3FF)); 
      return "&#x" . dechex($scalar) . ";"; 
     }, $string); 
    /* now the rest */ 
    $string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i', 
     function ($matches) { 
      //just to remove leading zeros 
      return "&#x" . dechex(hexdec($matches[1])) . ";"; 
     }, $string); 
    return $string; 
}

来源

2010-08-24 00:41:18 Artefacto

如何将unicode代码点转换为十六进制HTML实体？

回答

相关问题