如何计算与C++

我写了一个简单的代码来计算一个text.This不同的字符数的文本的Unicode字符数为下面的代码：如何计算与C++

#include <iostream> 
#include <fstream> 
#include <map> 
using namespace std; 
const char* filename="text.txt"; 
int main() 
{ 
    map<char,int> dict; 
    fstream f(filename); 
    char ch; 
    while (f.get(ch)) 
    { 
     if(!f.eof()) 
      cout<<ch; 
     if (!dict[ch]) 
      dict[ch]=0; 
     dict[ch]++; 
    } 
    f.close(); 
    cout<<endl; 
    for (auto it=dict.begin();it!=dict.end();it++) 
    { 
     cout<<(*it).first<<":\t"<<(*it).second<<endl; 
    } 
    system("pause"); 
}

程序做以及计算ASCII字符，但它不能在Unicode字符如汉字字符。如果我想要它能够工作在Unicode字符如何解决问题？

来源

2013-05-20 罗泽轩

首先，你将需要解决一个编码。你知道你打算使用哪种编码吗？然后你需要弄清楚“角色”到底是什么意思。 –

没有'unicode character'这样的东西。您可以参考utf8everywhere.org获取unicode中不同字符概念之间的区别，或者参考“twitter如何计算字符”文章来验证不同的方法。无论哪种情况，计算代码点都没有什么意义。 –

您需要一个Unicode库来处理Unicode字符。编码 - 说 - UTF8自己将是一个艰难的任务，并重新发明轮子。

在this Q/A from SO有一个很好的提到，你会发现其他答案的建议。

来源

2013-05-20 16:18:59

除了ring0的参考资料外，有一个很好的解释 http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring以及 –

C++ 11处理Unicode，不是吗？ – cubuspl42

对于这样简单的事情，自己解释UTF-8非常简单直接，并且可以避免必须经历所有转换工作。 –

有一切的宽字符版本，但如果你想要做的东西非常相似，你现在有什么，都使用Unicode的16位版本：

map<short,int> dict; 
fstream f(filename); 
char ch; 
short val; 
while (1) 
{ 
    // Beware endian issues here - should work either way for char counting though. 
    f.get(ch); 
    val = ch; 
    f.get(ch); 
    val |= ch << 8; 

    if(val == 0) break; 

    if(!f.eof()) 
     cout<<val; 
    if (!dict[val]) 
     dict[val]=0; 
    dict[val]++; 
} 
f.close(); 
cout<<endl; 
for (auto it=dict.begin();it!=dict.end();it++) 
{ 
    cout<<(*it).first<<":\t"<<(*it).second<<endl; 
}

上面的代码，使大量的假设（所有字符16位，甚至文件中的字节数等），但它应该做你想做的事情，或者至少让你快速了解它可以如何处理宽字符。

来源

2013-05-20 16:23:40

不幸的是，有一些不是16位的字符。代码只是将数字打印到屏幕上，尽管我已经使用static_cast来改变类型）。我不知道如何将数字映射到真实的字符。 –

首先，您要计算什么？ Unicode码点或字形集群，即编码意义上的字符，还是读者感知的字符？另请注意，“宽字符”（16位字符）不是Unicode字符（UTF-16的长度与UTF-8类似，可变长度！）。

在任何情况下，获得一个库（如ICU）来执行实际的码点/集群迭代。对于计算你需要一个合适的类型，以便替换map的char类型（用于码点，或字形集群标准化弦32位unsigned int，正常化应该 - 再 - 用库照顾）

ICU： http://icu-project.org

字形集群：http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

正常化：http://unicode.org/reports/tr15/

来源

2013-05-20 16:31:01 Joe

是的。如果你想超越代码点，并且对待读者会认为单个角色的内容，那么这将是更多的工作。你也可能会认为大多数读者会考虑''A'' 和''a''是同一个字符，或者''a''和''是法语的相同字符，但是不同的字符在瑞典语。 –

你的英语也有这种情况。尽管使用分泌疗法已经不太流行，但它仍然有时用于诸如合作或天真之类的文字中。 – Joe

在德语中，你甚至可以认为ö应该算作o和e，因为从技术上讲，这两个字母是收缩的（而不是像瑞典语那样是一个字母） – Joe

如果你能compromize，只是指望代码点，这是相当简单直接使用UTF-8。然而，你的字典必须是std::map<std::string, int>。一旦你已经有了一个UTF-8的第一个字符：

while (f.get(ch)) { 
    static size_t const charLen[] = 
    { 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
      2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
      4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 0, 0, 
    } ; 
    int chLen = charLen[ static_cast<unsigned char>(ch) ]; 
    if (chLen <= 0) { 
     // error: impossible first character for UTF-8 
    } 
    std::string codepoint(1, ch); 
    -- chLen; 
    while (chLen != 0) { 
     if (!f.get(ch)) { 
      // error: file ends in middle of a UTF-8 code point. 
     } else if ((ch & 0xC0) != 0x80) { 
      // error: illegal following character in UTF-8 
     } else { 
      codepoint += ch; 
     } 
    } 
    ++ dict[codepoint]; 
}

你会注意到，大部分的代码参与错误处理。

来源

2013-05-20 16:39:45

如何计算与C++

回答

相关问题