突出显示按顺序出现的某些单词

我试图在突出显示某些单词和单词bigrams时打印文本。如果我不需要打印标点符号等其他标记，那么这将是相当直接的。突出显示按顺序出现的某些单词

我有一个要突出显示的单词列表以及突出显示的单词bigrams的另一个列表。

突出个性的话是相当容易的，例如像：

import re 
import string 

regex_pattern = re.compile("([%s \n])" % string.punctuation) 

def highlighter(content, terms_to_hightlight): 
    tokens = regex_pattern.split(content) 
    for token in tokens: 
     if token.lower() in terms_to_hightlight: 
      print('\x1b[6;30;42m' + token + '\x1b[0m', end="") 
     else: 
      print(token, end="")

中出现的顺序是比较复杂的，只有突出的话。我一直在玩迭代器，但一直没有能够提出任何不太复杂的东西。

来源

2017-05-23 Mountain_sheep

你能提供一个例子，你的'highlighter'函数按预期工作，*不符合预期？提示：“顺序出现的单词”是什么样子的？ – blacksite

你可以先将文本分割成一个列表，然后迭代该列表（类似你已经做过的那样）。然后，通过该列表并检查当前元素和下一个元素是否是有效的二元组，如果是这样，则将单词“突出显示”推入单独的列表中。否则，你可以将其“unhighlighted”推入列表中。确保始终检查前一个bigram是否已经突出显示当前项目（新列表）。 –

@not_a_robot他可能在寻找单词bigrams，这意味着连续两个单词。他试图突出几个词，如果他们在一个bigrams列表。这会导致重叠问题。 –

如果我正确地理解了这个问题，一种解决方案是展望下一个单词标记并检查双字母是否在列表中。

import re 
import string 

regex_pattern = re.compile("([%s \n])" % string.punctuation) 

def find_next_word(tokens, idx): 
    nonword = string.punctuation + " \n" 
    for i in range(idx+1, len(tokens)): 
     if tokens[i] not in nonword: 
      return (tokens[i], i) 
    return (None, -1) 

def highlighter(content, terms, bigrams): 
    tokens = regex_pattern.split(content) 
    idx = 0 
    while idx < len(tokens): 
     token = tokens[idx] 
     (next_word, nw_idx) = find_next_word(tokens, idx) 
     if token.lower() in terms: 
      print('*' + token + '*', end="") 
      idx += 1 
     elif next_word and (token.lower(), next_word.lower()) in bigrams: 
      concat = "".join(tokens[idx:nw_idx+1]) 
      print('-' + concat + '-', end="") 
      idx = nw_idx + 1 
     else: 
      print(token, end="") 
      idx += 1 

terms = ['man', 'the'] 
bigrams = [('once', 'upon'), ('i','was')] 
text = 'Once upon a time, as I was walking to the city, I met a man. As I was tired, I did not look once... upon this man.' 
highlighter(text, terms, bigrams)

调用时，这给：

-Once upon- a time, as -I was- walking to *the* city, I met a *man*. As -I was- tired, I did not look -once... upon- this *man*.

请注意：

这是一个贪心算法，它将匹配它找到的第一个两字。因此，例如，您检查yellow banana和banana boat，yellow banana boat始终突出显示为-yellow banana- boat。如果你想要另一种行为，你应该更新测试逻辑。
你可能也需要更新逻辑来管理，其中一个字既是terms和二元
我没有测试所有的特殊情况的第一部分的情况下，一些事情可能会破坏/可能有围栏-post错误
如果必要的话，您可以优化性能：
- 建设的两字的第一个单词的列表，并检查是否一个词在它做先行到下一个单词之前
- 和/或使用先行的结果一步处理两个之间的所有非单词记号字（实现这一步，应该足以确保线性性能）

希望这有助于。

来源

2017-05-23 15:38:50 pills

突出显示按顺序出现的某些单词

回答

相关问题