有没有办法删除字符串中重复和连续的单词/短语？

有没有办法删除重复和连续字符串中的单词/短语？例如。有没有办法删除字符串中重复和连续的单词/短语？

[中]：foo foo bar bar foo bar

[出]：foo bar foo bar

我已经试过这样：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool' 
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j] 
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu'] 
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]]) 
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'

当它变得有点复杂，我想会发生什么删除短语（假设短语可以由多达5个字组成）？如何做呢？例如。

[IN]：foo bar foo bar foo bar

[OUT]：foo bar

又如：

[IN]：this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .

[OUT]：this is a sentence where phrases duplicate . sentence are not prhases .

来源

2014-02-27 alvas

您可以使用re模块。

>>> s = 'foo foo bar bar' 
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s) 
'foo bar' 

>>> s = 'foo bar foo bar foo bar' 
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s) 
'foo bar foo bar'

如果你想匹配任何数量的连续出现：

>>> s = 'foo bar foo bar foo bar' 
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s) 
'foo bar'

编辑。你最后一个例子的补充。要做到这一点，您必须在有重复的短语时调用re.sub。所以：

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate' 
>>> while re.search(r'\b(.+)(\s+\1\b)+', s): 
... s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s) 
... 
>>> s 
'this is a sentence where phrases duplicate'

来源

2014-02-27 10:19:04 sharcashmo

聪明的答案！ +1但是，如果应用于一个非常大的字符串，会出现性能问题吗？ – ridgerunner

-1

txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool' 
txt2 = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate' 

def remove_duplicates(txt): 
    result = [] 
    for word in txt.split(): 
     if word not in result: 
      result.append(word) 
    return ' '.join(result)

输出继电器：

In [7]: remove_duplicate_words(txt1)                                 
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'                     

In [8]: remove_duplicate_words(txt2)                                 
Out[8]: 'this is a sentence where phrases duplicate'

来源

2014-02-27 10:45:52

-1

这应该可以解决任何数量的相邻重复的，并与您的两个例子工程。我的字符串转换成列表，解决它，然后再转换回字符串输出：

mywords = "foo foo bar bar foo bar" 
list = mywords.split() 
def remove_adjacent_dups(alist): 
    result = [] 
    most_recent_elem = None 
    for e in alist: 
     if e != most_recent_elem: 
      result.append(e) 
      most_recent_elem = e 
    to_string = ' '.join(result) 
    return to_string 

print remove_adjacent_dups(list)

输出：

foo bar foo bar

来源

2014-03-12 15:57:33 James

我爱itertools。似乎每次我想写点东西时，itertools都已经拥有了它。在这种情况下，groupby会获取一个列表，并将来自该列表的重复顺序项目分组为(item_value, iterator_of_those_values)的元组。使用它喜欢这里：

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool' 
>>> ' '.join(item[0] for item in groupby(s.split())) 
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

让我们说一点点扩展与返回其重复重复值的列表中删除的功能：

这一个字词组是伟大的，但没有帮助为更长的短语。该怎么办？嗯，首先，我们要通过跨过我们原来的短语来检查较长的词组：

def stride(lst, offset, length): 
    if offset: 
     yield lst[:offset] 

    while True: 
     yield lst[offset:offset + length] 
     offset += length 
     if offset >= len(lst): 
      return

现在我们做饭！好。所以我们的策略是首先删除所有的单词重复。接下来，我们将删除两个字的重复项，从偏移量0开始，然后是1.之后，从偏移量0,1和2开始的三个字的重复项，等等直到我们击中五个字的重复项：

def cleanse(list_of_words, max_phrase_length): 
    for length in range(1, max_phrase_length + 1): 
     for offset in range(length): 
      list_of_words = dedupe(stride(list_of_words, offset, length)) 

    return list_of_words

全部放在一起：

from itertools import chain, groupby 

def stride(lst, offset, length): 
    if offset: 
     yield lst[:offset] 

    while True: 
     yield lst[offset:offset + length] 
     offset += length 
     if offset >= len(lst): 
      return 

def dedupe(lst): 
    return list(chain(*[item[0] for item in groupby(lst)])) 

def cleanse(list_of_words, max_phrase_length): 
    for length in range(1, max_phrase_length + 1): 
     for offset in range(length): 
      list_of_words = dedupe(stride(list_of_words, offset, length)) 

    return list_of_words 

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .' 

b = 'this is a sentence where phrases duplicate . sentence are not prhases .' 

print ' '.join(cleanse(a.split(), 5)) == b

来源

2014-03-12 16:37:06

verbose yet itertooly =） – alvas

heh！我相信这可以缩短，其中的一部分是单行的，但我希望在简洁和可读性之间取得平衡。我希望我能击中它。 :-) –

就个人而言，我不认为我们需要使用任何其他模块为这个（虽然我承认他们有些是伟大的）。我只是通过简单的循环来管理这一点，首先将字符串转换为一个列表。我在上面列出的所有例子中尝试了它。它工作正常。

sentence = str(raw_input("Please enter your sentence:\n")) 

word_list = sentence.split() 

def check_if_same(i,j): # checks if two sets of lists are the same 

    global word_list 
    next = (2*j)-i # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len) 
    is_same = False 
    if word_list[i:j] == word_list[j:next]: 

     is_same = True 
     # The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not 
     #print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer) 

    return is_same 

phrase_len = 1 

while phrase_len <= int(len(word_list)/2): # checks the sentence for different phrase lengths 

    curr_word_index=0 

    while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length 

     result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity 

     if result == True: 
      del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase 
     else: 
      curr_word_index += 1 

    phrase_len += 1 

print "Answer: " + ' '.join(word_list)

来源

2014-03-17 11:06:36 sshashank124

类似于sharcashmo的图案的图案，你可以使用subn返回更换的次数，在while循环中：

import re 

txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .' 

pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b') 
repl = r'\1' 

res = txt 

while True: 
    [res, nbr] = pattern.subn(repl, res) 
    if (nbr == 0): 
     break 

print res

时，有没有更多的替代品的while循环停止。

使用此方法，您可以获取所有重叠匹配（在替换上下文中单次传递时不可能），而无需测试两次相同的模式。

来源

2014-03-18 18:40:35

有没有办法删除字符串中重复和连续的单词/短语？

回答

相关问题