使用多个分隔符高效地分割一个字符串并保留每个分隔符？

我需要使用string.punctuation和string.whitespace中的每个字符作为分隔符来拆分数据串。使用多个分隔符高效地分割一个字符串并保留每个分隔符？

此外，我需要分隔符保留在输出列表中，在它们在字符串中分隔的项目之间。

例如，

"Now is the winter of our discontent"

应该输出：

['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

我不知道如何做到这一点，而不诉诸嵌套循环的狂欢，这是不可接受的慢。我该怎么做？

来源

2012-11-01 blz

我猜，因为你接受你打算连续标点符号DSM的回答保持组合在一起？ – John

@johnthexiii，我接受它，因为它没有使用're'。将连续分隔符分组的选项是一个额外的好处，但我相信它也可以使用正则表达式轻松完成。 – blz

不同的非正则表达式的方式从别人：

>>> import string 
>>> from itertools import groupby 
>>> 
>>> special = set(string.punctuation + string.whitespace) 
>>> s = "One two three tab\ttabandspace\t end" 
>>> 
>>> split_combined = [''.join(g) for k, g in groupby(s, lambda c: c in special)] 
>>> split_combined 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end'] 
>>> split_separated = [''.join(g) for k, g in groupby(s, lambda c: c if c in special else False)] 
>>> split_separated 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t', ' ', 'end']

能使用的lambda代替dict.fromkeys和.get，我猜。

[编辑]

一些说明：

groupby接受两个参数，可迭代和（可选的）keyfunction。它通过循环可迭代和组将它们与keyfunction的值：

>>> groupby("sentence", lambda c: c in 'nt') 
<itertools.groupby object at 0x9805af4> 
>>> [(k, list(g)) for k,g in groupby("sentence", lambda c: c in 'nt')] 
[(False, ['s', 'e']), (True, ['n', 't']), (False, ['e']), (True, ['n']), (False, ['c', 'e'])]

其中具有keyfunction的连续值方面组合在一起。（这实际上是一个常见的错误来源 - 人们忘记了如果他们想要将可能不连续的术语分组，那么他们必须首先按keyfunc进行排序。）

正如@JonClements猜想的那样，我想到的是

>>> special = dict.fromkeys(string.punctuation + string.whitespace, True) 
>>> s = "One two three tab\ttabandspace\t end" 
>>> [''.join(g) for k,g in groupby(s, special.get)] 
['One', ' ', 'two', ' ', 'three', ' ', 'tab', '\t', 'tabandspace', '\t ', 'end']

对于我们合并分隔符的情况。如果该值不在字典中，则.get返回None。

来源

2012-11-01 22:08:24 DSM

或另一个选项，而不是lambda（尽管它很丑）'groupby（s，special .__ contains __）'...... –

@JonClements：是的，我想我会在使用特殊方法之前使用字典。：^） – DSM

'partial（contains，special）'then？ ;） –

import re 
import string 

p = re.compile("[^{0}]+|[{0}]+".format(re.escape(
    string.punctuation + string.whitespace))) 

print p.findall("Now is the winter of our discontent")

我使用正则表达式对所有的问题没有大风扇，但我不认为你有太多的选择在这一点，如果你想让它快而短。

，因为你不熟悉它，我会解释的正则表达式：

[...]指任何在方括号里面的人物的
[^...]意味着任何字符不广场内括号
+背后意味着一个或多个以前的事情
x|y意味着要匹配x或y

所以正则表达式1个或多个字符，其中无论是所有必须是标点符号和空格，或没有必须相匹配。方法findall查找模式的所有非重叠匹配。

来源

2012-11-01 21:56:33

你可能想使用're.escape（string.punctuation + string.whitespace）'，否则我认为你的字符类会在''''早期结束。 –

我不认为它适用于“..现在是我们不满的冬天” – John

@ F.J固定。而'“现在是我们不满的冬天”对我有用。 –

from string import punctuation, whitespace 

s = "..test. and stuff" 

f = lambda s, c: s + ' ' + c + ' ' if c in punctuation else s + c 
l = sum([reduce(f, word).split() for word in s.split()], []) 

print l

来源

2012-11-01 21:57:07 John

试试这个：

import re 
re.split('(['+re.escape(string.punctuation + string.whitespace)+']+)',"Now is the winter of our discontent")

说明从the Python documentation：

如果捕获括号在图案中使用，然后在图案中的所有组的文本也被返回的一部分结果列表。

来源

2012-11-01 21:58:23 Bula

带有连续空格的丑陋行为：'re.split（r'（）'，''* 2）'产生'[''，''，''，''，'']''。 –

@ F.J连续的空格/分隔符应该现在处理得更好。 – Bula

根据您所处理的文本，您可能能够将分隔符的概念简化为“除字母和数字以外的任何内容”。如果这将工作，你可以使用下面的正则表达式的解决方案：

re.findall(r'[a-zA-Z\d]+|[^a-zA-Z\d]', text)

这是假设你要分割每个单独的分隔符，即使他们会连续发生，所以'foo..bar'将成为['foo', '.', '.', 'bar']。如果您预期的是['foo', '..', 'bar']，请使用[a-zA-Z\d]+|[^a-zA-Z\d]+（唯一不同的是在最后加上+）。

来源

2012-11-01 22:02:09

这对于ASCII范围以外的字符不起作用。 – DzinX

解线性（O(n)）时间：

比方说，你有一个字符串：

original = "a, b...c d"

先转换所有分隔空间：

splitters = string.punctuation + string.whitespace 
trans = string.maketrans(splitters, ' ' * len(splitters)) 
s = original.translate(trans)

现在s == 'a b c d'。现在你可以使用itertools.groupby空间与非空间之间交替：

result = [] 
position = 0 
for _, letters in itertools.groupby(s, lambda c: c == ' '): 
    letter_count = len(list(letters)) 
    result.append(original[position:position + letter_count]) 
    position += letter_count

现在result == ['a', ', ', 'b', '...', 'c', ' ', 'd']，这是你所需要的。

来源

2012-11-01 22:04:30 DzinX

我的看法：

from string import whitespace, punctuation 
import re 

pattern = re.escape(whitespace + punctuation) 
print re.split('([' + pattern + '])', 'now is the winter of')

来源

2012-11-01 22:07:01

+1分钟后写完全一样的东西;） – DzinX

带连续分隔符的丑陋行为：'re.split（'（['+ pattern +']）'，'..'）'result in'[''， '。'，''，'。'，'']'。 –

-1

from itertools import chain, cycle, izip 

s = "Now is the winter of our discontent" 
words = s.split() 

wordsWithWhitespace = list(chain.from_iterable(izip(words, cycle([" "])))) 
# result : ['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent', ' ']

来源

2012-11-01 22:07:39 lucasg

-1：仅适用于空格作为分隔符。 – DzinX

用于分隔符的任意集合：

def separate(myStr, seps): 
    answer = [] 
    temp = [] 
    for char in myStr: 
     if char in seps: 
      answer.append(''.join(temp)) 
      answer.append(char) 
      temp = [] 
     else: 
      temp.append(char) 
    answer.append(''.join(temp)) 
    return answer 

In [4]: print separate("Now is the winter of our discontent", set(' ')) 
['Now', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent'] 

In [5]: print separate("Now, really - it is the winter of our discontent", set(' ,-')) 
['Now', ',', '', ' ', 'really', ' ', '', '-', '', ' ', 'it', ' ', 'is', ' ', 'the', ' ', 'winter', ' ', 'of', ' ', 'our', ' ', 'discontent']

希望这有助于

来源

2012-11-01 22:18:04 inspectorG4dget

这可能会开始变慢你使用'string.punctuation + string.whitespace'作为'seps'参数 - 对于每个字符，你都在线性时间内搜索分隔符列表。 – DzinX

如果您将它们作为“集合”传递，则不适用 – inspectorG4dget

使用多个分隔符高效地分割一个字符串并保留每个分隔符？

回答

相关问题