嵌套for循环的逐元素列表比较

作为一种新的方法来解决我的挑战描述here，我已经把以下内容：嵌套for循环的逐元素列表比较

from difflib import SequenceMatcher 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() 

diffs =[ 
"""- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""", 
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""", 
"""+ Here's a new paragraph I added for testing."""] 

for s in diffs: 
    others = [i for i in diffs if i != s] 
    for j in others: 
     if similar(s, j) > 0.7: 
      print '"{}" and "{}" refer to the same sentence'.format(s, j) 
      print 
      diffs.remove(j) 
     else: 
      print '"{}" is a new sentence'.format(s)

的想法是遍历字符串，并用比较每个其他。如果给定的字符串被认为与另一个字符串相似，则删除另一个字符串，否则给定的字符串将被视为列表中的唯一字符串。

下面是输出：

"- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence 


"- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence 
"+ Here's a new paragraph I added for testing." is a new sentence

因此它正确地检测到前两句是相似的，那最后是独一无二的。问题在于它回归并认为第一句话是独一无二的（不是，它不应该返回到这个句子）。

我的循环逻辑中的缺陷在哪里？这可以实现而不嵌套for s和删除元素？

来源

2016-02-19 Pyderman

**不要**修改列表在遍历它 – spicavigo

@spicavigo好吧，这很明显，因此问题。 – Pyderman

你不能删除当你还在迭代它时，从'diffs'中读取项目;它会搞砸迭代。相反，累积差异列表以删除并在最后删除它们。另外，您可能会使用'itertools.combinations'代替嵌套的for循环来加速代码。 – BrenBarn

from difflib import SequenceMatcher 
from collections import defaultdict 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() 

diffs =[ 
"""- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""", 
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""", 
"""+ Here's a new paragraph I added for testing."""] 


sims = set() 
simdict = defaultdict(list) 
for i in range(len(diffs)): 
    if i in sims: 
     continue 
    s = diffs[i] 

    for j in range(i+1, len(diffs)): 
     r = diffs[j] 
     if similar(s, r) > 0.7: 
      sims.add(j) 
      simdict[i].append(j) 


for k, v in simdict.iteritems(): 
    print diffs[k] + " is similar to:" 
    print '\n'.join(diffs[e] for e in v)

来源

2016-02-19 21:44:44 spicavigo

谢谢。 'remaining = diff [：]'应该读取'remaining = diffs [：]'。即使有这种改变，输出结果也表明逻辑没有做它正在做的事情：http://pastebin.com/xQSRjEV5 – Pyderman

它应该是'if not flag' – spicavigo

通过将'diffs'的副本作为你有，我认为保持'.remove（）'是好的。但是你仍然有拼写错误（'diff' /'diffs'），并且当错误修正时你的代码仍然不起作用。 – Pyderman

你可以清楚地看到，当它决定的第一句话是唯一通过改变

print '"{}" is a new sentence'.format(s)

到

print '"{}" and "{}" are different sentences'.format(s,j)

这应该帮助你清楚地看到你的循环失败。

来源

2016-02-19 21:59:22 charfellow

由于修改字符串将始终显示背到后端（一个前面有“ - ”，其他“+”，和“ - ”，下面可以做（我相信它会在所有情况下）。

当列表中有奇数个元素，最后必然是一个新的句子。

def extract_modified_and_new(diffs): 
    for z1, z2 in zip(diffs[::2], diffs[1::2]): 
     if similar(z1, z2) > 0.7: 
      print z1, 'is similar to', z2 
      print 
     else: 
      print z1, ' and ', z2, 'are new' 
      print 
    if len(diffs) % 2 != 0: 
      print diffs[-1], ' is new'

来源

2016-02-20 04:22:13 Pyderman

嵌套for循环的逐元素列表比较

回答

相关问题