2012-10-24 125 views
2

我有一个std ::字符串输出。使用utf8proc我想将其转换为有效的utf8字符串。 http://www.public-software-group.org/utf8proc-documentationC++字符串到UTF8有效字符串使用utf8proc

typedef int int32_t; 
#define ssize_t int 
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options) 
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned. 
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash! 

因此,首先,我怎么在末尾添加一个额外的字节?那么如何将std :: string转换为int32_t * buffer?

这不起作用:

std::string g = output(); 
fprintf(stdout,"str: %s\n",g.c_str()); 
g += " "; //add an extra byte?? 
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0); 
fprintf(stdout,"strutf8: %s\n",g.c_str()); 
+0

'std :: string'只是一个字节序列。什么编码是你的源'std :: string'中的? –

+0

每当我在C++程序中看到'printf'时,都会畏缩,尤其是输出字符串。 –

+0

@Charles Bailey:输出并不总是相同的编码。通常它是utf8,但有时它是我现在知道的一些编码。 –

回答

0

你很可能并不真正想要utf8proc_reencode() - 该功能需要一个有效的UTF-32缓冲区,把它变成一个有效的UTF-8缓冲区,但既然你说你不知道你的数据是什么编码,那么你不能使用该功能。

因此,首先需要确定数据的实际编码方式。您可以使用http://utfcpp.sourceforge.net/来测试您是否已使用有效的UTF-8和utf8::is_valid(g.begin(), g.end())。如果那是真的,你就完成了!

如果错误,事情会变得复杂......但ICU(http://icu-project.org/)可以帮助您;请参阅http://userguide.icu-project.org/conversion/detection

一旦您可靠地知道数据的编码情况,ICU就可以再次帮助您获得UTF-8。例如,假设您的源数据g位于ISO-8859-1:

UErrorCode err = U_ZERO_ERROR; // check this after every call... 
// CONVERT FROM ISO-8859-1 TO UChar 
UConverter *conv_from = ucnv_open("ISO-8859-1", &err); 
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough 
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err); 
converted.resize(conv_len); 
ucnv_close(conv_from); 
// CONVERT FROM UChar TO UTF-8 
g.resize(converted.size()*4); 
UConverter *conv_u8 = ucnv_open("UTF-8", &err); 
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err); 
g.resize(u8_len); 
ucnv_close(conv_u8); 
之后您的 g现在保存UTF-8数据。