2009-08-19 81 views
68

我正在测试我的一些代码如何处理错误的数据,并且我需要几个无效的UTF-8字节序列。示例无效utf8字符串?

你可以发表一些,理想情况下,他们为什么坏/你得到他们的解释吗?

+3

[真的好,坏UTF-8示例性测试数据(的可能的复制http://stackoverflow.com/questions/1319022/really-good-bad-utf-8-example-测试数据) – Claudiu 2016-04-18 15:04:22

回答

-2

模糊测试 - 生成一个随机序列的八位字节。很可能你会比以后得到一些非法序列。

+1

没有什么比海森堡或艾森特测试更糟。测试通过10次,您释放产品,测试失败。 – 2017-11-21 15:03:33

+0

@EricDuminil曾听说过srand()? – shoosh 2017-11-21 19:22:16

+0

够公平的。你能否在回答中提到它,以便我可以恢复我的downvote? – 2017-11-21 19:26:42

40

在PHP:

$examples = array(
    'Valid ASCII' => "a", 
    'Valid 2 Octet Sequence' => "\xc3\xb1", 
    'Invalid 2 Octet Sequence' => "\xc3\x28", 
    'Invalid Sequence Identifier' => "\xa0\xa1", 
    'Valid 3 Octet Sequence' => "\xe2\x82\xa1", 
    'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1", 
    'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28", 
    'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc", 
    'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc", 
    'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc", 
    'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28", 
    'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1", 
    'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1", 
); 

http://www.php.net/manual/en/reference.pcre.pattern.modifiers.php#54805

1

的形成不良的字节序列模式的概念可以从良构的字节序列的表中获取。请参阅Unicode标准6.2中的“Table 3-7. Well-Formed UTF-8 Byte Sequences”。

Code Points First Byte Second Byte Third Byte Fourth Byte 
    U+0000 - U+007F 00 - 7F 
    U+0080 - U+07FF C2 - DF 80 - BF 
    U+0800 - U+0FFF E0   A0 - BF  80 - BF 
    U+1000 - U+CFFF E1 - EC 80 - BF  80 - BF 
    U+D000 - U+D7FF ED   80 - 9F  80 - BF 
    U+E000 - U+FFFF EE - EF 80 - BF  80 - BF 
U+10000 - U+3FFFF F0   90 - BF  80 - BF 80 - BF 
U+40000 - U+FFFFF F1 - F3 80 - BF  80 - BF 80 - BF 
U+100000 - U+10FFFF F4   80 - 8F  80 - BF 80 - BF 

下面是从U + 24B62生成的示例。我用他们的错误报告:Bug #65045 mb_convert_encoding breaks well-formed character

// U+24B62: "\xF0\xA4\xAD\xA2" 
"\xF0\xA4\xAD" ."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2" 
"\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD\xA2"."\xF0\xA4\xAD" 

的尾随字节的范围([0x80的,为0xBF])的过度简单化可以在各个库中可以看出。

// U+0800 - U+0FFF 
\xE0\x80\x80 

// U+D000 - U+D7FF 
\xED\xBF\xBF 

// U+10000 - U+3FFFF 
\xF0\x80\x80\x80 

// U+100000 - U+10FFFF 
\xF4\xBF\xBF\xBF 
1

,̆特别邪恶。我在Ubuntu上看到了它的组合。

逗号杆菌