查找只出现一次的单词

我只检索文件中的唯一单词，这里是我迄今为止的内容，但是有没有更好的方法可以在大O表示法中实现这一点？眼下这为n的平方。如果你想找到的所有独特的文字和考虑foo一样foo.查找只出现一次的单词

def retHapax(): 
    file = open("myfile.txt") 
    myMap = {} 
    uniqueMap = {} 
    for i in file: 
     myList = i.split(' ') 
     for j in myList: 
      j = j.rstrip() 
      if j in myMap: 
       del uniqueMap[j] 
      else: 
       myMap[j] = 1 
       uniqueMap[j] = 1 
    file.close() 
    print uniqueMap

来源

2015-04-02 godzilla

你的意思是独一无二的，因为它们中仅出现一次？ – 2015-04-02 12:13:16

是的，单词只出现一次 – godzilla 2015-04-02 12:16:04

这是O（n），而不是O（n^2），因为Python字典/集合查找是O（1），除非你有怪异的键导致_lots_的散列冲突。如果你的代码使用了集合而不是字典，那么它的内存效率会稍高一些，但它们都是作为散列表实现的。但是，使用Counter是一个更好的计划：它使代码更易于阅读，并且将更多工作委托给以C速度运行的代码，而不是在测试时以Python速度运行。 – 2015-04-02 12:31:26

尝试使用此方法获得的唯一字的file.using Counter

from collections import Counter 
with open("myfile.txt") as input_file: 
    word_counts = Counter(word for line in input_file for word in line.split()) 
>>> [word for (word, count) in word_counts.iteritems() if count==1] 
-> list of unique words (words that appear exactly once)

来源

2015-04-02 12:13:40 itzMEonTV

这可以使用集？ – godzilla 2015-04-02 12:16:18

'set（f）'如何找到唯一的单词？ – 2015-04-02 12:18:48

更新，我认为它可以:) – itzMEonTV 2015-04-02 12:19:47

，你需要去掉标点符号。

from collections import Counter 
from string import punctuation 

with open("myfile.txt") as f: 
    word_counts = Counter(word.strip(punctuation) for line in f for word in line.split()) 

print([word for word, count in word_counts.iteritems() if count == 1])

如果你想忽略大小写，你还需要使用line.lower()。如果你想准确地得到独特的单词，那么除了在空白处分割行之外，还有更多的涉及。

来源

2015-04-02 12:16:15

使用'print（[k for k，v in c.items（）if v == 1]）'而不是'__getitem__'调用会更有效率...... – 2015-04-02 12:19:28

@JonClements，是的，只需要更少的时间来写另一种方式;） – 2015-04-02 12:22:02

使用'.iteritems（）' - 更小的内存占用会更有效率。 – EOL 2015-04-02 12:25:59

你可以稍微修改你的逻辑和（使用套例如，而不是类型的字典），它从独特的前进第二次出现：

words = set() 
unique_words = set() 
for w in (word.strip() for line in f for word in line.split(' ')): 
    if w in words: 
     continue 
    if w in unique_words: 
     unique_words.remove(w) 
     words.add(w) 
    else: 
     unique_words.add(w) 
print(unique_words)

来源

2015-04-02 12:18:32 AChampion

我认为OP正试图找到文件中只出现一次的世界。 – hitzg 2015-04-02 12:20:31

@hitzg;编辑也使这个答案正确。 – 2015-04-02 13:49:16

如果仅仅执行'line.split（）'（不带参数），就不需要'word.strip（）'。 – EOL 2015-04-03 03:56:03

我会去与collections.Counter的做法，但如果你只想使用set S，那么你可以通过这样做：

with open('myfile.txt') as input_file: 
    all_words = set() 
    dupes = set() 
    for word in (word for line in input_file for word in line.split()): 
     if word in all_words: 
      dupes.add(word) 
     all_words.add(word) 

    unique = all_words - dupes

鉴于输入：

one two three 
two three four 
four five six

具有的输出：

{'five', 'one', 'six'}

来源

2015-04-02 12:36:04

这是最有效的解决方案 – 2015-04-02 12:58:05

@Padraic，除非你做了一些'timeit's - 我怀疑它是.. 。Counter方法更直观，更高效 – 2015-04-02 13:01:29

我刚刚计时，1.16ms对2000字的1.68ms – 2015-04-02 13:01:43

查找只出现一次的单词

回答

相关问题