问题的解码UTF-8 JSON在Perl当使用JSON库处理

UTF-8字符被破坏（也许这是类似于Problem with decoding unicode JSON in perl，然而设置binmode仅创建另一个问题）。问题的解码UTF-8 JSON在Perl当使用JSON库处理

我已经减少了问题，下面的这个例子：

(hlovdal) localhost:/tmp/my_test>cat my_test.pl 
#!/usr/bin/perl -w 
use strict; 
use warnings; 
use JSON; 
use File::Slurp; 
use Getopt::Long; 
use Encode; 

my $set_binmode = 0; 
GetOptions("set-binmode" => \$set_binmode); 

if ($set_binmode) { 
     binmode(STDIN, ":encoding(UTF-8)"); 
     binmode(STDOUT, ":encoding(UTF-8)"); 
     binmode(STDERR, ":encoding(UTF-8)"); 
} 

sub check { 
     my $text = shift; 
     return "is_utf8(): " . (Encode::is_utf8($text) ? "1" : "0") . ", is_utf8(1): " . (Encode::is_utf8($text, 1) ? "1" : "0"). ". "; 
} 

my $my_test = "hei på deg"; 
my $json_text = read_file('my_test.json'); 
my $hash_ref = JSON->new->utf8->decode($json_text); 

print check($my_test), "\$my_test = $my_test\n"; 
print check($json_text), "\$json_text = $json_text"; 
print check($$hash_ref{'my_test'}), "\$\$hash_ref{'my_test'} = " . $$hash_ref{'my_test'} . "\n"; 

(hlovdal) localhost:/tmp/my_test>

当运行测试的文本是出于某种原因crippeled成ISO-8859-1。设置binmode排序解决了它，但然后导致其他字符串的双重编码。

(hlovdal) localhost:/tmp/my_test>cat my_test.json 
{ "my_test" : "hei på deg" } 
(hlovdal) localhost:/tmp/my_test>file my_test.json 
my_test.json: UTF-8 Unicode text 
(hlovdal) localhost:/tmp/my_test>hexdump -c my_test.json 
0000000 {  " m y _ t e s t "  :  " h 
0000010 e i  p 303 245  d e g "  } \n   
000001e 
(hlovdal) localhost:/tmp/my_test> 
(hlovdal) localhost:/tmp/my_test>perl my_test.pl 
is_utf8(): 0, is_utf8(1): 0. $my_test = hei på deg 
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" } 
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg 
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode 
is_utf8(): 0, is_utf8(1): 0. $my_test = hei pÃ¥ deg 
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei pÃ¥ deg" } 
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg 
(hlovdal) localhost:/tmp/my_test>

这是什么原因造成的以及如何解决？

这是在一个新安装的和最新的Fedora 15系统上。

(hlovdal) localhost:/tmp/my_test>perl --version | grep version 
This is perl 5, version 12, subversion 4 (v5.12.4) built for x86_64-linux-thread-multi 
(hlovdal) localhost:/tmp/my_test>rpm -q perl-JSON 
perl-JSON-2.51-1.fc15.noarch 
(hlovdal) localhost:/tmp/my_test>locale 
LANG=en_US.UTF-8 
LC_CTYPE="en_US.UTF-8" 
LC_NUMERIC="en_US.UTF-8" 
LC_TIME="en_US.UTF-8" 
LC_COLLATE="en_US.UTF-8" 
LC_MONETARY="en_US.UTF-8" 
LC_MESSAGES="en_US.UTF-8" 
LC_PAPER="en_US.UTF-8" 
LC_NAME="en_US.UTF-8" 
LC_ADDRESS="en_US.UTF-8" 
LC_TELEPHONE="en_US.UTF-8" 
LC_MEASUREMENT="en_US.UTF-8" 
LC_IDENTIFICATION="en_US.UTF-8" 
LC_ALL= 
(hlovdal) localhost:/tmp/my_test>

更新：添加use utf8不解决它，字符仍然没有处理的右（虽然略有来自之前不同）：

(hlovdal) localhost:/tmp/my_test>perl my_test.pl 
is_utf8(): 1, is_utf8(1): 1. $my_test = hei p� deg 
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei på deg" } 
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei p� deg 
(hlovdal) localhost:/tmp/my_test>perl my_test.pl --set-binmode 
is_utf8(): 1, is_utf8(1): 1. $my_test = hei på deg 
is_utf8(): 0, is_utf8(1): 0. $json_text = { "my_test" : "hei pÃ¥ deg" } 
is_utf8(): 1, is_utf8(1): 1. $$hash_ref{'my_test'} = hei på deg 
(hlovdal) localhost:/tmp/my_test>

如perlunifaq

注意到我可以在我的Perl源代码中使用Unicode吗？

是的，你可以！如果你的来源是 UTF-8编码，您可以指示与使用UTF8编译。
use utf8; 
这不会对您的输入或输出产生任何影响。它只影响你的来源是的阅读方式。您可以在字符串中使用Unicode 文字，在标识符（但他们仍然必须按照\ w为“单词字符” ），甚至在定制分隔符。

来源

2011-06-30 hlovdal

-3

问题的核心是一个八位组阵列的JSON的期望，而不是字符串（在this question解决）。但是我也错过了与unicode相关的几件事情，比如“use utf8”。这是使该示例中的代码完全工作所需的DIFF：

--- my_test.pl.orig 2011-08-03 15:44:44.217868886 +0200 
+++ my_test.pl 2011-08-03 15:55:30.152379269 +0200 
@@ -1,19 +1,14 @@ 
-#!/usr/bin/perl -w 
+#!/usr/bin/perl -CSAD 
use strict; 
use warnings; 
use JSON; 
use File::Slurp; 
use Getopt::Long; 
use Encode; 
- 
-my $set_binmode = 0; 
-GetOptions("set-binmode" => \$set_binmode); 
- 
-if ($set_binmode) { 
-  binmode(STDIN, ":encoding(UTF-8)"); 
-  binmode(STDOUT, ":encoding(UTF-8)"); 
-  binmode(STDERR, ":encoding(UTF-8)"); 
-} 
+use utf8; 
+use warnings qw< FATAL utf8 >; 
+use open qw(:encoding(UTF-8) :std); 
+use feature qw<unicode_strings>; 

sub check { 
     my $text = shift; 
@@ -21,8 +16,9 @@ 
} 

my $my_test = "hei på deg"; 
-my $json_text = read_file('my_test.json'); 
-my $hash_ref = JSON->new->utf8->decode($json_text); 
+my $json_text = read_file('my_test.json', binmode => ':encoding(UTF-8)'); 
+my $json_bytes = encode('UTF-8', $json_text); 
+my $hash_ref = JSON->new->utf8->decode($json_bytes); 

print check($my_test), "\$my_test = $my_test\n"; 
print check($json_text), "\$json_text = $json_text";

来源

2011-08-03 14:07:05 hlovdal

为什么选择投票-1？ – hlovdal

您将程序保存为UTF-8，但忘记告诉Perl。添加use utf8;。

而且，你是编程太复杂。 JSON函数DWYM。要检查的东西，使用Devel :: Peek。

use utf8; # for the following line 
my $my_test = 'hei på deg'; 

use Devel::Peek qw(Dump); 
use File::Slurp (read_file); 
use JSON qw(decode_json); 

my $hash_ref = decode_json(read_file('my_test.json')); 

Dump $hash_ref; # Perl character strings 
Dump $my_test; # Perl character string

来源

2011-06-30 23:11:00 daxim

谢谢你的回答，但它不回答我的问题。添加'print $$ hash_ref {'my_test'}，“\ n”，打印$ my_test，“\ n”;'在程序的最后以iso-8859-1格式打印两次。正如perlunifaq所指出的那样，'use utf8'只影响源代码，而不影响I/O。 – hlovdal

将Perl字符串打印到STDOUT是一个错误;你必须首先将它们编码为八位字节，明确地说'使用Encode qw（encode）;打印编码'UTF-8'，$ my_test;'或隐式'binmode STDOUT'：encoding（UTF-8）';打印$ my_test;'。此外，这不是ISO-8859-1，但Perl的内部编码由于历史原因而非常复杂，并且只会发生*看起来像你所说的那样。您需要对编码主题进行严格的介绍，请阅读http://p3rl.org/UNI – daxim

daxim的这个答案在简单性方面相当完美（我认为）。谢谢！ – Wick

难道只有我的印象中，还是这Perl库指望你写UTF-8字节的代码放到一个isoLatin1字符串（UTF-8标志在字符串上禁用）; 同样地，它返回到您UTF-8字节码在ISO拉丁字符串：

#! /usr/bin/perl -w 
use strict; 
use Encode; 
use Data::Dumper qw(Dumper); 
use JSON; # imports encode_json, decode_json, to_json and from_json. 
use utf8; 

############### 
## EXAMPLE 1: 
################ 
my $json = JSON->new->allow_nonref; 
my $exampleAJsonObj = { key1 => 'a'}; 
my $exampleAText = $json->utf8->encode($exampleAJsonObj); 
my $exampleAJsonObfUtf = { key1 => 'ä'}; 
my $exampleATextUtf = $json->utf8->encode($exampleAJsonObfUtf); 


#binmode(STDOUT, ":utf8"); 
print "EXAMPLE1: "; 
print "\n"; 
print encode 'UTF-8', "exampleAText: $exampleAText and as object: " . Dumper($exampleAJsonObj); 
print "\n"; 
print encode 'UTF-8', "exampleATextUtf: $exampleATextUtf and as object: " . Dumper($exampleAJsonObfUtf) . " Key1 was: " . $exampleAJsonObfUtf->{key1}; 
print "\n"; 
print hexdump($exampleAText); 
print "\n"; 
print hexdump($exampleATextUtf); 
print "\n"; 

############################# 
## SUB. 
############################# 
# For a given string parameter, returns a string which shows 
# whether the utf8 flag is enabled and a byte-by-byte view 
# of the internal representation. 
# 
sub hexdump 
{ 
    my $str = shift; 
    my $flag = Encode::is_utf8($str) ? 1 : 0; 
    use bytes; # this tells unpack to deal with raw bytes 
    my @internal_rep_bytes = unpack('C*', $str); 
    return 
     $flag 
     . '(' 
     . join(' ', map { sprintf("%02x", $_) } @internal_rep_bytes) 
     . ')'; 
}

最后，输出是：

exampleAText: {"key1":"a"} and as object: $VAR1 = { 
      'key1' => 'a' 
     }; 

exampleATextUtf: {"key1":"Ã¤"} and as object: $VAR1 = { 
      'key1' => "\x{e4}" 
     }; 
Key1 was: ä 
0(7b 22 6b 65 79 31 22 3a 22 61 22 7d) 
0(7b 22 6b 65 79 31 22 3a 22 c3 a4 22 7d)

所以，我们看到，在这个过程结束输出字符串都不是UTF-8字符串，它是错误的。至少，0（7b 22 6b 65 79 31 22 3a 22 c3 a4 22 7d）。注意，C3 A4是一个 http://www.utf8-chartable.de/

因此正确的字节代码，库似乎想到了一个矿脉到非UTF-8字符串中的UTF-8字节代码，结果，它会做同样的事情，它将输出一个utf-8字节码的NON utf-8字符串。

我错了吗？

进一步的实验使我得出结论： perlObjects返回并消耗了标记为UTF-8的字符串（正如我所预料的那样）。从解码/编码中消耗并返回的perl字符串必须以ISO拉丁文1字符串的形式出现在Perl中，但具有utf8字节码。因此，打开包含UTF8 json的文件时，请勿使用“<：encoding（UTF-8）”。

来源

2012-07-18 11:01:30 99Sono

问题的解码UTF-8 JSON在Perl当使用JSON库处理

回答

相关问题