红宝石：由字节 - 长度限定一个UTF-8字符串

This RabbitMQ page状态：红宝石：由字节 - 长度限定一个UTF-8字符串

队列名称可能是高达255个字节UTF-8字符的。

在ruby（1.9.3）中，我如何截断UTF-8字符串的字节数而不会破坏字符的中间？生成的字符串应该是符合字节限制的最长可能的有效UTF-8字符串。

2012-09-21 Kelvin

我想我发现了一些作品。

def limit_bytesize(str, size) 
    str.encoding.name == 'UTF-8' or raise ArgumentError, "str must have UTF-8 encoding" 

    # Change to canonical unicode form (compose any decomposed characters). 
    # Works only if you're using active_support 
    str = str.mb_chars.compose.to_s if str.respond_to?(:mb_chars) 

    # Start with a string of the correct byte size, but 
    # with a possibly incomplete char at the end. 
    new_str = str.byteslice(0, size) 

    # We need to force_encoding from utf-8 to utf-8 so ruby will re-validate 
    # (idea from halfelf). 
    until new_str[-1].force_encoding('utf-8').valid_encoding? 
    # remove the invalid char 
    new_str = new_str.slice(0..-2) 
    end 
    new_str 
end

用法：

>> limit_bytesize("abc\u2014d", 4) 
=> "abc" 
>> limit_bytesize("abc\u2014d", 5) 
=> "abc" 
>> limit_bytesize("abc\u2014d", 6) 
=> "abc—" 
>> limit_bytesize("abc\u2014d", 7) 
=> "abc—d"

更新...

没有active_support分解的行为：

>> limit_bytesize("abc\u0065\u0301d", 4) 
=> "abce" 
>> limit_bytesize("abc\u0065\u0301d", 5) 
=> "abce" 
>> limit_bytesize("abc\u0065\u0301d", 6) 
=> "abcé" 
>> limit_bytesize("abc\u0065\u0301d", 7) 
=> "abcéd"

分解后与active_support行为：

>> limit_bytesize("abc\u0065\u0301d", 4) 
=> "abc" 
>> limit_bytesize("abc\u0065\u0301d", 5) 
=> "abcé" 
>> limit_bytesize("abc\u0065\u0301d", 6) 
=> "abcéd"

来源

2012-09-21 18:42:16 Kelvin

如何：

s = "δogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδogδog" 
count = 0 
while true 
    more_truncate = "a" + (255-count).to_s 
    s2 = s.unpack(more_truncate)[0] 
    s2.force_encoding 'utf-8' 

    if s2[-1].valid_encoding? 
    break 
    else 
    count += 1 
    end 
end 

s2.force_encoding 'utf-8' 
puts s2

来源

2012-09-21 18:34:31 halfelf

它的工作原理，但如果字符串的是巨大的什么？一次删除一个utf-8字符可能非常低效。 – Kelvin

@Kelvin Answer已编辑。现在应该好多了。由于utf-8字符不会超过6个字节，循环将很快结束。 – halfelf

似乎不完整 - 's'没有改变。你需要打包's2'来获得新的字符串吗？请记住，输出也必须是utf-8。 – Kelvin

bytesize会给你以字节为单位的字符串长度（只要字符串的编码设置正确），诸如slice之类的操作不会破坏字符串。

一个简单的过程是只通过串

s.each_char.each_with_object('') do|char, result| 
    if result.bytesize + char.bytesize > 255 
    break result 
    else 
    result << char 
    end 
end

迭代如果你被狡猾你复制的第63个字符直接，因为任何Unicode字符是在UTF-8最4个字节。

请注意，这仍然不完美。例如，假设字符串的最后4个字节是字符'e'并结合了尖锐的重音。切分最后2个字节会产生一个仍然是utf8的字符串，但用户所看到的会将输出从'é'更改为'e'，这可能会改变文本的含义。当你只是命名RabbitMQ队列时，这可能不是什么大问题，但在其他情况下可能很重要。例如，在法语中，通讯标题为“Un policiertué”意为“一名警察遇害”，而“Un policier tue”则意为“一名警察杀死”。

来源

2012-09-21 18:50:53

+1只是为了警察的例子:)。谷歌翻译确认它。尽管发音听起来不同。 – Kelvin

大家都知道，“组合角色”问题只发生在[分解角色]（http://en.wikipedia.org/wiki/Precomposed_character）上。没有问题，如果电子急性等是一个字符。 – Kelvin

你可以通过转换为规范形式C来避免它首先 –

对于Rails> = 3.0，您拥有ActiveSupport :: Multibyte :: Chars限制方法。

从API文档：

- (Object) limit(limit)

限制的字符串的字节数的字节大小没有打破字符。由于某些原因，字符串的存储空间受到限制时可用。

例子：

'こんにちは'.mb_chars.limit(7).to_s # => "こん"

来源

2014-09-22 12:37:56 jogaco

不错，这似乎是最好的解决方案，如果你使用ActiveSupport> = 3.0。如果分解了字符，你仍然需要使用'mb_chars.compose.limit'（请参阅我的答案）。 – Kelvin

红宝石：由字节 - 长度限定一个UTF-8字符串

回答

相关问题