比较python中的两个大列表

我有一个包含大约400个单词的列表。还有另一个列表，其中每个列表包含大约150,000个单词。这个清单有20个这样的清单。比较python中的两个大列表

现在我想看看这1500个单词列表中所有这400个单词中有多少单词出现。我也想从这400个单词中知道一个单词，在150k单词列表中出现多少次，其中哪些单词出现次数最多，次数多少等。

唯一的解决方案我能想到的是多项式时间解决方案。这是一个非常糟糕的解决方案，将是地狱很多慢：

for one_list in list_of_150kwords: 
    for key in 400_words: 
     for word in one_list: 
      if key == word: 
       # count this word 
       # do other stuff

这是一个非常丑陋和坏的解决方案，但我想不出什么更好的。我试图通过将这些列表转换成NumPy数组来尝试：

list_of_150kwords = numpy.array(list_of_150kwords) 
...

但我仍然觉得它很慢。其他解决方案？或者任何图书馆？

来源

2014-02-15 avi

这听起来像使用set的好机会：

set_of_150kwords = set(list_of_150kwords) 
one_set = set(one_list) 

len(one_set & set_of_150kwords) # set intersection is & 
=> number of elements common to both sets

按集理论，两个集合的交集给出了出现在都集合中的元素，那么它是采取一个简单的问题它的长度。对于第二部分（这些词中哪些最常出现，多少次等）创建一个Counter与list_of_150kwords，这将告诉你每个单词出现在列表中的次数。交集将告诉你哪些是常用词，解决你的两个要求。

来源

2014-02-15 18:17:11

哦，我没试过集。他们比NumPy更快吗？让我跑步，看看 – avi

我相信'set'和'Counter'是这里工作的正确工具，不仅仅是'numpy'数组。 –

但是我如何计算'one_list'中的单词出现在'set_of_150kwords'多少次？ – avi

from collections import Counter 

search_data = [ 
    ["list", "of", "150k", "words"], 
    ["another", "list", "of", "150k", "words"], 
    ["yet", "another", "list", "of", "150k", "words"] 
    # ... 17 more of these 
] 

search_words = ["four", "hundred", "words", "to", "search", "for"] 

def word_finder(words_to_find): 
    lookfor = set(word.lower() for word in words_to_find) 
    def get_word_count(text): 
     return Counter(word for word in (wd.lower() for wd in text) if word in lookfor) 
    return get_word_count 

def get_words_in_common(counters): 
    # Maybe use c.viewkeys() instead of set(c)? Which is faster? 
    return reduce(operator.and_, (set(c) for c in counters)) 

def main(): 
    wordcount = word_finder(search_words) 
    counters = [wordcount(wordlst) for wordlst in search_data] 
    common_to_all = get_words_in_common(counters) 
    print(common_to_all) 

if __name__=="__main__": 
    main()

来源

2014-02-15 20:57:19

“也许使用c.viewkeys（）而不是set（c）？哪个更快？“< - 不要怀疑，也不在乎，如果代码运行速度太慢，配置文件，如果这是缓慢的部分，* test *。在此之前，编写尽可能*可读*的代码。 –

这是一个Trie将是有用的地方的典型例子。你需要为你的每个150K列表创建一个Trie。然后你可以检查给定的单词是否存在于O（W）时间的列表中。其中W是单词的最大长度。

然后，您可以遍历400个单词列表并检查每个作品是否在150K字词列表中。

假设L即150K列表的数量远小于150K并且W远小于150K，那么集合连接将永远不会像Trie比较那样快。

的最终运行的复杂性：

N = 400 //Size of small list 
W = 10 // Max word Length 
M = 150K // Max size of the 150K lists 
P = 4 // Number of 150K lists 

P * M // To construct Trie 
N * P * W // To find each word in the 150k lists 
MP + NPW // Total complexit

来源

2014-02-16 21:47:31

经典的Map Reduce问题.... http://sist.sysu.edu.cn/~wuweig/DSCC/Inverted%20Indexing%20by%20MapReduce.pdf

来源

2014-02-17 07:22:24 roshan