C++比较2个字符串列表

在Python中，对比较2个字符串列表非常方便（请参阅link）。我想知道在性能方面是否有一个好的C++解决方案。每个列表中都有超过一百万个字符串。C++比较2个字符串列表

区分大小写的匹配。

来源

2012-09-12 Stan

使用的Python的C++模拟设置：性病::设置 – Yuushi

是列表排序？ –

该列表未被排序。目标是在两个列表（交集）中找到匹配的字符串。 – Stan

数据类型std::set<>（通常实现为平衡树）和std::unordered_set<>（来自C++ 11，实现为散列）可用。还有一种称为std::set_intersection的便利算法，用于计算实际交叉点。

这里是一个例子。

#include <iostream> 
#include <vector> 
#include <string> 
#include <set>    // for std::set 
#include <algorithm>  // for std::set_intersection 

int main() 
{ 
    std::set<std::string> s1 { "red", "green", "blue" }; 
    std::set<std::string> s2 { "black", "blue", "white", "green" }; 

    /* Collecting the results in a vector. The vector may grow quite 
    large -- it may be more efficient to print the elements directly. */  
    std::vector<std::string> s_both {}; 

    std::set_intersection(s1.begin(),s1.end(), 
         s2.begin(),s2.end(), 
         std::back_inserter(s_both)); 

    /* Printing the elements collected by the vector, just to show that 
    the result is correct. */ 
    for (const std::string &s : s_both) 
    std::cout << s << ' '; 
    std::cout << std::endl; 

    return 0; 
}

注意。如果您想使用std::unordered_set<>，则不能像这样使用std::set_intersection，因为它需要对输入集进行排序。你必须使用通常的for-loop技术迭代遍历较小的集合，并找到较大集合中的元素来确定交集。尽管如此，对于大量元素（特别是字符串），基于散列的std::unordered_set<>可能会更快。也有STL兼容的实现，如Boost（boost::unordered_set）和Google创建的（sparse_hash_set and dense_hash_set）。对于各种其他实现和基准（包括一个字符串），请参阅here。

来源

2012-09-12 07:09:24 jogojapan

“一个for-loop遍历较小的集合并在较大的集合中查找元素” - 假设一个集合包含另一个集合中的所有元素......更典型的是，您希望/需要标记/记录那些看到的元素通过另一组的第二个循环。另外值得注意的是，如果目标是将结果写出来，那么创建一个临时's_both'集合会浪费内存，但这是一个很好的例子。 –

@TonyDelroy是的，将结果放入向量中可能会浪费。我会在帖子中添加一条评论，这是为了说明目的。请注意，我了解其他评论。我认为目标是找到交集（即元素两个集合有共同点），因为这是OP链接到的Python脚本的作用。对于集合交集，遍历一个列表并搜索另一个列表就足够了，即使其他列表不是第一个列表的超集。（当然这假设搜索另一个列表是有效的，如果列表是散列集则是这样。 – jogojapan

@jogojapan：对不起伙伴 - 我只是阅读“比较”，并没有按照链接看到Python只是一个交集（谁想读Python？; -P）。公平点然后，从我的+1。 –

如果您并不需要太多的表现，我建议用图/来自STL设置：

list<string> list, list2; 
... 
set<string> sndList; 
list<string> result; 

for(list<string>::iterator it = list2.begin(); it != list2.end(); ++it) 
    sndList.insert(*it); 

for(list<string>::iteratir it = list.begin(); it != list.end(); ++it) 
    if(sndList.count(*it) > 0) 
     result.push_back(*it);

否则我建议一些散列函数进行比较。

来源

2012-09-12 07:07:11 Ari

如果它确实是一个std::list你，对它们进行排序，并使用set_intersection：

list<string> words1; 
list<string> words2; 
list<string> common_words; 

words1.sort(); 
words2.sort(); 

set_intersection(words1.begin(), words1.end(), 
       words2.begin(), words2.end(), 
       back_inserter(common_words));

来源

2012-09-12 13:42:28

C++比较2个字符串列表

回答

相关问题