2017-05-30 134 views
0

我想使用python re.split()以逗号将一个句子分割成多个字符串,但我不想将其应用于用逗号分隔的单个单词,例如:重新分割特殊情况以分割逗号分隔的字符串

s = "Yes, alcohol can have a place in a healthy diet." 
desired result = ["Yes, alcohol can have a place in a healthy diet."] 

另一个例子:

s = "But, of course, excess alcohol is terribly harmful to health in a variety of ways, and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer." 
desired output = ["But, of course" , "excess alcohol is terribly harmful to health in a variety of ways" , "and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer."] 

任何指针?请。

+0

你尝试过这么远吗? – depperm

+1

也许你应该在逗号分割,然后重新组合单个单词与下一个短语。另外,如果有多个这样的词“嘿,嘿,嘿,当然,是......”? –

+0

@depperm,我试过像sep = re.split('(?<!\ d)[,](?!\ d)',string)和其他没有人似乎是防弹的 –

回答

1

因为Python不支持可变长度lookbehind assertions在正则表达式,我会使用re.findall()代替:

In [3]: re.findall(r"\s*((?:\w+,)?[^,]+)",s) 
Out[3]: 
['But, of course', 
'excess alcohol is terribly harmful to health in a variety of ways', 
'and even moderatealcohol intake is associated with an increase in the number two cause of premature death: cancer.'] 

说明:

\s*  # Match optional leading whitespace, don't capture that 
(   # Capture in group 1: 
(?:\w+,)? # optionally: A single "word", followed by a comma 
[^,]+  # and/or one or more characters except commas 
)   # End of group 1 
+0

一个额外的请求,我们可以修改正则表达式以满足以下要求。 [“头颈部癌,食道癌,肝癌,结肠癌,直肠癌和乳腺癌都与饮酒有关。”] –

+0

困难。你的输入是否只包含一个句子,或者可能有多个?如果是后者,你应该首先使用NLP工具将输入分成单独的句子。然后我认为这可以做到。 –

+0

是的,我的输入包含单个句子,因为我已经在使用NLP将大字符串拆分为单个句子。 :) –