子串在字符串中的位置

我需要知道文本中某个单词的所有位置 - 字符串中的子串。到目前为止的解决方案是使用正则表达式，但我不确定是否没有更好的，可能内置标准库策略。有任何想法吗？子串在字符串中的位置

import re 

text = "The quick brown fox jumps over the lazy dog. fox. Redfox." 
links = {'fox': [], 'dog': []} 
re_capture = u"(^|[^\w\-/])(%s)([^\w\-/]|$)" % "|".join(links.keys()) 

iterator = re.finditer(re_capture, text) 

if iterator: 
    for match in iterator: 

     # fix position by context 
     # (' ', 'fox', ' ') 
     m_groups = match.groups() 
     start, end = match.span() 
     start = start + len(m_groups[0]) 
     end = end - len(m_groups[2]) 

     key = m_groups[1] 
     links[key].append((start, end)) 

print links

{ '狐狸'：[（16，19），（45，48）]， '狗'：[（40，43）]}

编辑：部分的话不允许匹配 - 见狐狸Redfox不在链接。

谢谢。

来源

2015-10-02 rebeling

重复http://stackoverflow.com/questions/3437059/does-python-have-a-字符串包含子字符串方法 –

@RNar这不是一个重复的原因OP寻找*所有*发生。 – alfasin

为什么你的正则表达式如此复杂？也是重新是标准库的一部分吧 –

如果要匹配实际的话，你的字符串包含ASCII：

text = "fox The quick brown fox jumps over the fox! lazy dog. fox!." 
links = {'fox': [], 'dog': []} 

from string import punctuation 
def yield_words(s,d): 
    i = 0 
    for ele in s.split(" "): 
     tot = len(ele) + 1 
     ele = ele.rstrip(punctuation) 
     ln = len(ele) 
     if ele in d: 
      d[ele].append((i, ln + i)) 
     i += tot 
    return d

这不像找到解决将不匹配部分单词和IT在O(n)时间：

In [2]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox." 

In [3]: links = {'fox': [], 'dog': []} 

In [4]: yield_words(text,links) 
Out[4]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}

这可能是一个情况下一个reg EX是一个很好的方法，它可以只是简单得多：

def reg_iter(s,d): 
    r = re.compile("|".join([r"\b{}\b".format(w) for w in d])) 
    for match in r.finditer(s): 
     links[match.group()].append((match.start(),match.end())) 
    return d

输出：

的

In [6]: links = {'fox': [], 'dog': []} 

In [7]: text = "The quick brown fox jumps over the lazy dog. fox. Redfox." 

In [8]: reg_iter(text, links) 
Out[8]: {'dog': [(40, 43)], 'fox': [(16, 19), (45, 48)]}

来源

2015-10-02 23:21:41

到目前为止，您的答案是我最喜欢的 - reg_iter - 更短，速度更快，它解决了我的问题中甚至没有提到的边缘情况：当我使用德语元音变形处理大量文本时，您的代码刚刚工作也是这样。 – rebeling

评分和解释将很快添加 - 可能会有别的东西放在桌子上，我们都没有想到过，谢谢你的回答;） – rebeling

@rebeling，不用担心，很高兴它有帮助 –

不是Python的，没有正则表达式：

text = "The quick brown fox jumps over the lazy dog. fox." 
links = {'fox': [], 'dog': []} 

for key in links: 
    pos = 0 
    while(True): 
     pos = text.find(key, pos) 
     if pos < 0: 
      break 
     links[key].append((pos, pos + len(key))) 
     pos = pos + 1 
print(links)

来源

2015-10-02 23:00:49

我喜欢你的代码，你可以编辑将你的整个代码缩进四个空格吗？此外，如果您要将'链接链接'改为'链接链接'来匹配正常的字典处理，那就太棒了。 –

部分文字不允许匹配 - 请参阅Redfox。 – rebeling

你的代码在我的情况下不起作用 - 许多条件适用于比赛。谢谢你的努力。 – rebeling

子串在字符串中的位置

回答

相关问题