Perl：utf8 :: decode与Encode :: decode

我有一些有趣的结果，试图辨别使用Encode::decode("utf8", $var)和utf8::decode($var)之间的差异。我已经发现，在一个变量上多次调用前者最终会导致一个错误：“无法解码带有宽字符的字符串......”，而后一种方法会很高兴地随意运行，只是返回false。Perl：utf8 :: decode与Encode :: decode

我难以理解的是length函数根据您使用哪种解码方法返回不同的结果。问题出现是因为我正在处理来自外部文件的“双重编码”utf8文本。为了演示这个问题，我在一行中创建了一个带有以下Unicode字符的文本文件“test.txt”：U + 00e8，U + 00ab，U + 0086，U + 000a。这些Unicode字符是Unicode字符U + 8acb的双重编码，以及换行符。该文件以UTF8编码到磁盘。我然后运行以下perl脚本：

#!/usr/bin/perl                                   
use strict; 
use warnings; 
require "Encode.pm"; 
require "utf8.pm"; 

open FILE, "test.txt" or die $!; 
my @lines = <FILE>; 
my $test = $lines[0]; 

print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
my @unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
my @hex = (unpack('H*', $test)); 
print "Hex:\[email protected]\n"; 

print "==============\n"; 

$test = Encode::decode("utf8", $test); 
print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
@unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
@hex = (unpack('H*', $test)); 
print "Hex:\[email protected]\n"; 

print "==============\n"; 

$test = Encode::decode("utf8", $test); 
print "Length: " . (length $test) . "\n"; 
print "utf8 flag: " . utf8::is_utf8($test) . "\n"; 
@unicode = (unpack('U*', $test)); 
print "Unicode:\[email protected]\n"; 
@hex = (unpack('H*', $test)); 

print "Hex:\[email protected]\n";

这给出了以下的输出：

Length: 7 
utf8 flag: 
Unicode: 
195 168 194 171 194 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
232 171 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 2 
utf8 flag: 1 
Unicode: 
35531 10 
Hex: 
e8ab8b0a

这是我所期望的那样。长度最初是7，因为perl认为$ test只是一系列字节。在解码一次之后，perl知道$ test是一系列utf8编码的字符（即不是返回7字节的长度，perl返回长度为4个字符，即使$ test在内存中仍然是7字节）。第二次解码后，$ test包含4个字节，解释为2个字符，这是我所期望的，因为Encode :: decode取4个编码点并将它们解释为utf8编码的字节，结果为2个字符。奇怪的是，当我修改代码来调用utf8 :: decode代替（用utf8 :: decode（$ test）替换所有$ test = Encode :: decode（“utf8”，$ test））

This给出几乎相同的输出，仅长度的结果不同：

 
Length: 7 
utf8 flag: 
Unicode: 
195 168 194 171 194 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
232 171 139 10 
Hex: 
c3a8c2abc28b0a 
============== 
Length: 4 
utf8 flag: 1 
Unicode: 
35531 10 
Hex: 
e8ab8b0a

好像perl的解码（如预期），则第一解码后计数的字符之前第一计数的字节数，但在此之后再次计数字节第二次解码（不是预期的）。为什么会发生这种转换？我对理解这些解码函数的工作方式是否存在失误？

谢谢
马特

来源

2010-12-02 Matt

为什么你需要模块而不是使用它们？ – 2010-12-02 21:08:50

我没有use utf8，因为这样做会告诉perl你的代码本身是utf8编码的，我不需要（http://perldoc.perl.org/utf8.html）。我想我可以use D编码，但我恰巧不是。 – Matt 2010-12-02 21:41:36

你不应该从utf8编译模块使用的功能。 Its documentation这么说：

不要使用这个编译指示来告诉Perl你的脚本是用UTF-8编写的。

Always use the Encode module，并且还看到问题Checklist for going the Unicode way with Perl。 unpack太低级别，它甚至没有给你错误检查。

的octects E8 AB 86 0A是UTF-8编码双人物諆和newline的结果你会错误的假设。这是这些字符的单个UTF-8编码的表示。也许整个你身边的困惑都源于这个错误。

length被不恰当地重载，在某些时候它确定了字符长度或八位字节长度。使用更好的工具，如Devel::Peek。

#!/usr/bin/env perl 
use strict; 
use warnings FATAL => 'all'; 
use Devel::Peek qw(Dump); 
use Encode qw(decode); 

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}"; 
# or read the octets without implicit decoding from a file, does not matter 

Dump $test; 
# FLAGS = (PADMY,POK,pPOK) 
# PV = 0x8d8520 "\350\253\206\n"\0 

$test = decode('UTF-8', $test, Encode::FB_CROAK); 
Dump $test; 
# FLAGS = (PADMY,POK,pPOK,UTF8) 
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]

来源

2010-12-03 14:04:04 daxim

原来这是一个bug：https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190。

来源

2011-10-21 18:45:00 Matt

Perl：utf8 :: decode与Encode :: decode

回答

相关问题