2016-02-19 85 views
0

作为一种新的方法来解决我的挑战描述here,我已经把以下内容:嵌套for循环的逐元素列表比较

from difflib import SequenceMatcher 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() 

diffs =[ 
"""- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""", 
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""", 
"""+ Here's a new paragraph I added for testing."""] 

for s in diffs: 
    others = [i for i in diffs if i != s] 
    for j in others: 
     if similar(s, j) > 0.7: 
      print '"{}" and "{}" refer to the same sentence'.format(s, j) 
      print 
      diffs.remove(j) 
     else: 
      print '"{}" is a new sentence'.format(s) 

的想法是遍历字符串,并用比较每个其他。如果给定的字符串被认为与另一个字符串相似,则删除另一个字符串,否则给定的字符串将被视为列表中的唯一字符串。

下面是输出:

"- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence 


"- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence 
"+ Here's a new paragraph I added for testing." is a new sentence 

因此它正确地检测到前两句是相似的,那最后是独一无二的。问题在于它回归并认为第一句话是独一无二的(不是,它不应该返回到这个句子)。

我的循环逻辑中的缺陷在哪里?这可以实现而不嵌套for s和删除元素?

+3

**不要**修改列表在遍历它 – spicavigo

+0

@spicavigo好吧,这很明显,因此问题。 – Pyderman

+1

你不能删除当你还在迭代它时,从'diffs'中读取项目;它会搞砸迭代。相反,累积差异列表以删除并在最后删除它们。另外,您可能会使用'itertools.combinations'代替嵌套的for循环来加速代码。 – BrenBarn

回答

1
from difflib import SequenceMatcher 
from collections import defaultdict 

def similar(a, b): 
    return SequenceMatcher(None, a, b).ratio() 

diffs =[ 
"""- It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""", 
"""+ It contains a Title II provision that changes the age at which workers 
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""", 
"""+ Here's a new paragraph I added for testing."""] 


sims = set() 
simdict = defaultdict(list) 
for i in range(len(diffs)): 
    if i in sims: 
     continue 
    s = diffs[i] 

    for j in range(i+1, len(diffs)): 
     r = diffs[j] 
     if similar(s, r) > 0.7: 
      sims.add(j) 
      simdict[i].append(j) 


for k, v in simdict.iteritems(): 
    print diffs[k] + " is similar to:" 
    print '\n'.join(diffs[e] for e in v) 
+0

谢谢。 'remaining = diff [:]'应该读取'remaining = diffs [:]'。即使有这种改变,输出结果也表明逻辑没有做它正在做的事情:http://pastebin.com/xQSRjEV5 – Pyderman

+0

它应该是'if not flag' – spicavigo

+0

通过将'diffs'的副本作为你有,我认为保持'.remove()'是好的。但是你仍然有拼写错误('diff' /'diffs'),并且当错误修正时你的代码仍然不起作用。 – Pyderman

0

你可以清楚地看到,当它决定的第一句话是唯一通过改变

print '"{}" is a new sentence'.format(s) 

print '"{}" and "{}" are different sentences'.format(s,j) 

这应该帮助你清楚地看到你的循环失败。

0

由于修改字符串将始终显示背到后端(一个前面有“ - ”,其他“+”,和“ - ”,下面可以做(我相信它会在所有情况下)。

当列表中有奇数个元素,最后必然是一个新的句子。

def extract_modified_and_new(diffs): 
    for z1, z2 in zip(diffs[::2], diffs[1::2]): 
     if similar(z1, z2) > 0.7: 
      print z1, 'is similar to', z2 
      print 
     else: 
      print z1, ' and ', z2, 'are new' 
      print 
    if len(diffs) % 2 != 0: 
      print diffs[-1], ' is new'