我想根据空格和标点符号拆分字符串,但空格和标点符号仍应位于结果中。在空白处拆分字符串,但不要删除它们
例如:
Input: text = "This is a text; this is another text.,."
Output: ['This', ' ', 'is', ' ', 'a', ' ', 'text', '; ', 'this', ' ', 'is', ' ', 'another', ' ', 'text', '.,.']
这是目前我在做什么:
def classify(b):
"""
Classify a character.
"""
separators = string.whitespace + string.punctuation
if (b in separators):
return "separator"
else:
return "letter"
def tokenize(text):
"""
Split strings to words, but do not remove white space.
The input must be of type str, not bytes
"""
if (len(text) == 0):
return []
current_word = "" + text[0]
previous_mode = classify(text)
offset = 1
results = []
while offset < len(text):
current_mode = classify(text[offset])
if current_mode == previous_mode:
current_word += text[offset]
else:
results.append(current_word)
current_word = text[offset]
previous_mode = current_mode
offset += 1
results.append(current_word)
return results
它的工作原理,但它是如此的C风格。 Python中有更好的方法吗?
@ TigerhawkT3:这个问题稍微牵扯一点,因为它分裂的不仅仅是空白。但同时它只是一种变化,我完全忘记了这个答案。 :-) –