libxml2 xmlChar * to std :: wstring

libxml2似乎将所有的字符串存储在UTF-8中，如xmlChar *。libxml2 xmlChar * to std :: wstring

/** 
* xmlChar: 
* 
* This is a basic byte in an UTF-8 encoded string. 
* It's unsigned allowing to pinpoint case where char * are assigned 
* to xmlChar * (possibly making serialization back impossible). 
*/ 
typedef unsigned char xmlChar;

由于libxml2是一个C库，没有提供程序来得到一个std::wstring出xmlChar *的。我想知道的谨慎方式是否xmlChar *转换为在C++ std::wstring 11是使用mbstowcs C函数，通过这样的事情（工作正在进行中）：

std::wstring xmlCharToWideString(const xmlChar *xmlString) { 
    if(!xmlString){abort();} //provided string was null 
    int charLength = xmlStrlen(xmlString); //excludes null terminator 
    wchar_t *wideBuffer = new wchar_t[charLength]; 
    size_t wcharLength = mbstowcs(wideBuffer, (const char *)xmlString, charLength); 
    if(wcharLength == (size_t)(-1)){abort();} //mbstowcs failed 
    std::wstring wideString(wideBuffer, wcharLength); 
    delete[] wideBuffer; 
    return wideString; 
}

编辑：只是一个供参考，我很清楚xmlStrlen返回什么;这是用于存储字符串的xmlChar的数量;我知道这不是个字符的数量而是unsigned char的数量。如果我已经将它命名为byteLength，那就不那么令人困惑了，但我认为它会更清晰，因为我既有charLength也有wcharLength。至于代码的正确性，wideBuffer将会是大于或等于到保存缓冲区所需的大小，总是（我相信）。由于需要比wide_t更多空间的字符将被截断（我认为）。

来源

2013-01-01 Mr. Smith

如果您想谈论最谨慎的行为方式，请避免使用'wchar_t'和'wstring'。使用Unicode时，它们比弊端更好。 –

xmlStrlen()返回xmlChar*字符串中UTF-8编码码单元的数量。当数据转换时，编码码单元的编号不会相同，因此不要使用xmlStrlen()来分配wchar_t字符串的大小。您需要拨打std::mbtowc()一次以获得正确的长度，然后分配内存，并再次拨打mbtowc()填充内存。您还必须使用std::setlocale()来告知mbtowc()使用UTF-8（与区域设置混合可能不是一个好主意，特别是涉及多个线程时）。例如：

std::wstring xmlCharToWideString(const xmlChar *xmlString) 
{  
    if (!xmlString) { abort(); } //provided string was null 

    std::wstring wideString; 

    int charLength = xmlStrlen(xmlString); 
    if (charLength > 0) 
    { 
     char *origLocale = setlocale(LC_CTYPE, NULL); 
     setlocale(LC_CTYPE, "en_US.UTF-8"); 

     size_t wcharLength = mbtowc(NULL, (const char*) xmlString, charLength); //excludes null terminator 
     if (wcharLength != (size_t)(-1)) 
     { 
      wideString.resize(wcharLength); 
      mbtowc(&wideString[0], (const char*) xmlString, charLength); 
     } 

     setlocale(LC_CTYPE, origLocale); 
     if (wcharLength == (size_t)(-1)) { abort(); } //mbstowcs failed 
    } 

    return wideString; 
}

一个更好的选择，因为你提到C++ 11，是使用std::codecvt_utf8与std::wstring_convert代替你不必应付语言环境：

std::wstring xmlCharToWideString(const xmlChar *xmlString) 
{  
    if (!xmlString) { abort(); } //provided string was null 
    try 
    { 
     std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> conv; 
     return conv.from_bytes((const char*)xmlString); 
    } 
    catch(const std::range_error& e) 
    { 
     abort(); //wstring_convert failed 
    } 
}

另一种选择是使用实际的Unicode库（如ICU或ICONV）来处理Unicode转换。

来源

2013-01-01 02:13:02

'std :: codecvt_utf8'与'std :: wstring_convert'看起来很不错。谢谢！ –

此代码中存在一些问题，除了您正在使用wchar_t和std::wstring这一事实，除非您正在调用Windows API，否则这是一个坏主意。

xmlStrlen()不会做你认为它做的事。它计算字符串中UTF-8代码单元（又名.a.字节）的数量。它不计算字符的数量。这是documentation中的全部内容。
无论如何，计数字符不会轻易地为您提供wchar_t阵列的正确大小。因此，xmlStrlen()不仅做你认为它做的事情，你想要的也不是正确的事情。问题在于wchar_t的编码因平台而异，因此对于便携式代码来说它是100％无用的。
mbtowcs()函数是区域设置相关的。如果语言环境是UTF-8语言环境，它只会从UTF-8转换而来！
如果std::wstring构造函数抛出异常，此代码将泄漏内存。

我的建议：

使用UTF-8，如果在所有可能的。wchar_t兔子洞是很多额外的工作没有好处（除了制作Windows API调用的能力）。
如果您需要UTF-32，请使用std::u32string。请记住，wstring具有平台相关编码：它可以是可变长度编码（Windows）或固定长度（Linux，OS X）。

如果您绝对必须拥有wchar_t，那么您在Windows上的机会很大。这里是你如何做到这一点在Windows上：

std::wstring utf8_to_wstring(const char *utf8) 
{ 
    size_t utf8len = std::strlen(utf8); 
    int wclen = MultiByteToWideChar(
     CP_UTF8, 0, utf8, utf8len, NULL, 0); 
    wchar_t *wc = NULL; 
    try { 
     wc = new wchar_t[wclen]; 
     MultiByteToWideChar(
      CP_UTF8, 0, utf8, utf8len, wc, wclen); 
     std::wstring wstr(wc, wclen); 
     delete[] wc; 
     wc = NULL; 
     return wstr; 
    } catch (std::exception &) { 
     if (wc) 
      delete[] wc; 
    } 
}

如果你绝对必须有wchar_t，你是不是在Windows中，使用iconv()（见man 3 iconv，为手工man 3 iconv_open和man 3 iconv_close）。您可以指定"WCHAR_T"作为iconv()的其中一种编码。

记住：你可能不希望wchar_t或std::wstring。什么wchar_t可移植没有用处，并使其有用是不便携的。这就是生活。

来源

2013-01-01 02:04:24

libxml2 xmlChar * to std :: wstring

回答

相关问题