2017-08-17 80 views
2

我有一个wiki文本一样,我怎么能离开空格在nestedExpr pyparsing

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 

“””

,我想提取里面的宏 宏{{之间包围的文本和}}

所以我试图用nestedExpr

from pyparsing import * 
import pprint 

def getMacroCandidates(txt): 

    candidates = [] 

    def nestedExpr(opener="(", closer=")", content=None, ignoreExpr=quotedString.copy()): 
     if opener == closer: 
      raise ValueError("opening and closing strings cannot be the same") 
     if content is None: 
      if isinstance(opener,str) and isinstance(closer,str): 
       if ignoreExpr is not None: 
        content = (Combine(OneOrMore(~ignoreExpr + 
            ~Literal(opener) + ~Literal(closer) + 
            CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1)) 
           ).setParseAction(lambda t:t[0])) 
     ret = Forward() 
     ret <<= Group(opener + ZeroOrMore(ignoreExpr | ret | content) + closer) 

     ret.setName('nested %s%s expression' % (opener,closer)) 
     return ret 

    # use {}'s for nested lists 
    macro = nestedExpr("{{", "}}") 
    # print(((nestedItems+stringEnd).parseString(data).asList())) 
    for toks, preloc, nextloc in macro.scanString(data): 
     print(toks) 
    return candidates 

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 
""" 

getMacroCandidates(data) 

这使我的标记和空格去掉

[['{{', 'hello', '}}']] 
[['{{', 'hello', 'world', '}}']] 
[['{{', 'hello', 'much', '{', '}}']] 
[['{{', 'a', ['{{', 'b', '}}'], '}}']] 
[['{{', 'a', 'td', '{', '}', ['{{', 'inner', '}}'], '}}']] 

预先感谢您

+0

要得到一个解析表达式的原始文本,你可以使用'originalTextFor'帮手:'宏= originalTextFor(nestedExpr (“{{”,“}}”))'。这将保留所有空格,换行符等。 – PaulMcG

回答

0

您可以更换

data = """ 
{{hello}} 

{{hello world}} 
{{hello much { }} 
{{a {{b}}}} 

{{a 

td { 

} 
{{inner}} 
}} 
""" 

import shlex 
data1= data.replace("{{",'"') 
data2 = data1.replace("}}",'"') 
data3= data2.replace("}"," ") 
data4= data3.replace("{"," ") 
data5= ' '.join(data4.split()) 
print(shlex.split(data5.replace("\n"," "))) 

输出

这将返回所有的标记用大括号和用额外的线条空间去除的空白区域也被删除

['hello', 'hello world', 'hello much ', 'a b', 'a td inner '] 

PS:这可以给一个表达式进行多项表达用于可读性