2017-09-16 79 views
0

用下面的代码(有点乱,我承认)我用逗号分隔了一个字符串,但条件是当它不分隔时字符串中包含逗号分隔的单个词,例如: 它没有分开"Yup, there's a reason why you want to hit the sack just minutes after climax",但它分离成"The increase in heart rate, which you get from masturbating, is directly beneficial to the circulation, and can reduce the likelihood of a heart attack"['The increase in heart rate', 'which you get from masturbating', 'is directly beneficial to the circulation', 'and can reduce the likelihood of a heart attack']用逗号分隔字符串,但有条件(忽略用逗号分隔的单个词)

的问题是当它与这样的字符串遇到代码的目的失败:"When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."我不想催产素后分离,但催乳素后。我需要一个正则表达式来做到这一点。

import os 
import textwrap 
import re 
import io 
from textblob import TextBlob 


string = str(input_string) 

listy= [x.strip() for x in string.split(',')] 
listy = [x.replace('\n', '') for x in listy] 
listy = [re.sub('(?<!\d)\.(?!\d)', '', x) for x in listy] 
listy = filter(None, listy) # Remove any empty strings  

newstring= [] 

for segment in listy: 

    wc = TextBlob(segment).word_counts 

    if listy[len(listy)-1] != segment: 

     if len(wc) > 3: # len(segment.split(' ')) > 7: 
      newstring.append(segment+"&&") 
     else: 
      newstring.append(segment+",") 

    else: 

     newstring.append(segment) 

sep = [x.strip() for x in (' '.join(newstring)).split('&&')] 

回答

1

考虑以下..

mystr="When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow." 

rExp=r",(?!\s+(?:and\s+)?\w+,)" 
mylst=re.compile(rExp).split(mystr) 
print(mylst) 

应该给下面的输出..

['When men ejaculate', ' it releases a slew of chemicals including oxytocin, vasopressin, and prolactin', ' all of which naturally help you hit the pillow.'] 

让我们来看看我们是如何分割字符串...

,(?!\s+\w+,) 

使用每个逗号((?! - >否定展望)\s+\w+,空格和一个逗号词。
以上将在vasopressin, and的情况下失败,因为and之后没有跟着,。所以在内部引入条件and\s+

,(?!\s+(?:and\s+)?\w+,) 

虽然我可能要使用下面

,(?!\s+(?:(?:and|or)\s+)?\w+,) 

测试正则表达式here
测试代码here

的本质考虑更换您的行

listy= [x.strip() for x in string.split(',')] 

listy= [x.strip() for x in re.split(r",(?!\s+(?:and\s+)?\w+,)",string)] 
+0

尽管我相信正确的英文用法是'a,b和c'而不是'a,b和c'。因此,如果适当的英语然后只是',(?!\ s + \ w +,)'会起作用。 – kaza

+0

当然,谢谢你的详细解答。 Upvoting你。 –

+0

优秀的答案。 –