2012-03-10 22 views
3

之间我解析有这样的诗句为的Python:在空间分割,除了某些字符

 
type("book") title("golden apples") pages(10-35 70 200-234) comments("good read") 

一个文件,我想这个分成不同的字段。

在我的示例中,有四个字段:类型,标题,页面和注释。

分割后期望的结果是

 
['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments("good read")] 

很显然,一个简单的字符串分割将无法正常工作,因为它会在每一个空间分割刚。 我想分割空格,但在括号和引号之间保留任何内容。

我该如何分割?

回答

8

此正则表达式应该为你工作\s+(?=[^()]*(?:\(|$))

result = re.split(r"\s+(?=[^()]*(?:\(|$))", subject) 

说明

r""" 
\s    # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks) 
    +    # Between one and unlimited times, as many times as possible, giving back as needed (greedy) 
(?=   # Assert that the regex below can be matched, starting at this position (positive lookahead) 
    [^()]   # Match a single character NOT present in the list “()” 
     *    # Between zero and unlimited times, as many times as possible, giving back as needed (greedy) 
    (?:    # Match the regular expression below 
        # Match either the regular expression below (attempting the next alternative only if this one fails) 
     \(   # Match the character “(” literally 
     |    # Or match regular expression number 2 below (the entire group fails if this one fails to match) 
     $    # Assert position at the end of a line (at the end of the string or before a line break character) 
    ) 
) 
""" 
+0

不错,虽然它似乎在返回的列表中添加了一些额外的括号(我不知道它们来自哪里)。我使用py3。 – MxyL 2012-03-10 07:48:20

+2

试试这个:'re.split(r“\ s +(?= [^()] *(?:\(| $))”,subject)' – San4ez 2012-03-10 07:50:06

+1

@Keikoku修正了它,这是因为捕获组。 – 2012-03-10 07:51:13

1

我会尝试使用正向后看断言。

r'(?<=\))\s+' 

实施例:

>>> import re 
>>> result = re.split(r'(?<=\))\s+', 'type("book") title("golden apples") pages(10-35 70 200-234) comments("good read")') 
>>> result 
['type("book")', 'title("golden apples")', 'pages(10-35 70 200-234)', 'comments(
"good read")'] 
+1

如果输入文本中没有括号,例如'test test test',那么将不起作用 – 2012-03-10 07:58:53

+1

问题已经定义了格式,test test test不是可能的。 – dave 2012-03-10 14:33:10

1

拆分上") "并添加)回到每个元件除了最后。