作为一种新的方法来解决我的挑战描述here,我已经把以下内容:嵌套for循环的逐元素列表比较
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
diffs =[
"""- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).""",
"""+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).""",
"""+ Here's a new paragraph I added for testing."""]
for s in diffs:
others = [i for i in diffs if i != s]
for j in others:
if similar(s, j) > 0.7:
print '"{}" and "{}" refer to the same sentence'.format(s, j)
print
diffs.remove(j)
else:
print '"{}" is a new sentence'.format(s)
的想法是遍历字符串,并用比较每个其他。如果给定的字符串被认为与另一个字符串相似,则删除另一个字符串,否则给定的字符串将被视为列表中的唯一字符串。
下面是输出:
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." and "+ It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA)." refer to the same sentence
"- It contains a Title II provision that changes the age at which workers
compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA)." is a new sentence
"+ Here's a new paragraph I added for testing." is a new sentence
因此它正确地检测到前两句是相似的,那最后是独一无二的。问题在于它回归并认为第一句话是独一无二的(不是,它不应该返回到这个句子)。
我的循环逻辑中的缺陷在哪里?这可以实现而不嵌套for
s和删除元素?
**不要**修改列表在遍历它 – spicavigo
@spicavigo好吧,这很明显,因此问题。 – Pyderman
你不能删除当你还在迭代它时,从'diffs'中读取项目;它会搞砸迭代。相反,累积差异列表以删除并在最后删除它们。另外,您可能会使用'itertools.combinations'代替嵌套的for循环来加速代码。 – BrenBarn