如果你有递归嵌套表达式,你可以在逗号分割并验证它们是否与pyparsing相匹配:
import pyparsing as pp
def CommaSplit(txt):
''' Replicate the function of str.split(',') but do not split on nested expressions or in quoted strings'''
com_lok=[]
comma = pp.Suppress(',')
# note the location of each comma outside an ignored expression:
comma.setParseAction(lambda s, lok, toks: com_lok.append(lok))
ident = pp.Word(pp.alphas+"_", pp.alphanums+"_") # python identifier
ex1=(ident+pp.nestedExpr(opener='<', closer='>')) # Ignore everthing inside nested '< >'
ex2=(ident+pp.nestedExpr()) # Ignore everthing inside nested '()'
ex3=pp.Regex(r'("|\').*?\1') # Ignore everything inside "'" or '"'
atom = ex1 | ex2 | ex3 | comma
expr = pp.OneOrMore(atom) + pp.ZeroOrMore(comma + atom)
try:
result=expr.parseString(txt)
except pp.ParseException:
return [txt]
else:
return [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
tests='''\
obj<1, 2, 3>, x(4, 5), "msg, with comma"
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma"
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3>
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3>
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)
'''
for te in tests.splitlines():
result=CommaSplit(te)
print(te,'==>\n\t',result)
个
打印:
obj<1, 2, 3>, x(4, 5), "msg, with comma" ==>
['obj<1, 2, 3>', ' x(4, 5)', ' "msg, with comma"']
nesteobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), "msg, with comma" ==>
['nesteobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', ' "msg, with comma"']
nestedobj<1, sub<6, 7>, 3>, nestedx(4, y(8, 9), 5), 'msg, with comma', additional<1, sub<6, 7>, 3> ==>
['nestedobj<1, sub<6, 7>, 3>', ' nestedx(4, y(8, 9), 5)', " 'msg, with comma'", ' additional<1, sub<6, 7>, 3>']
bare_comma<1, sub(6, 7), 3>, x(4, y(8, 9), 5), , 'msg, with comma', obj<1, sub<6, 7>, 3> ==>
['bare_comma<1, sub(6, 7), 3>', ' x(4, y(8, 9), 5)', ' ', " 'msg, with comma'", ' obj<1, sub<6, 7>, 3>']
bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3) ==>
["bad_close<1, sub<6, 7>, 3), x(4, y(8, 9), 5), 'msg, with comma', obj<1, sub<6, 7>, 3)"]
当前的行为就像'(something does not split), b, "in quotes", c'.split',')
包括保持前导空格和引号。从田间剥去报价和领先空间是微不足道的。
更改else
下try
到:使用迭代器和发电机
else:
rtr = [txt[st:end] for st,end in zip([0]+[e+1 for e in com_lok],com_lok+[len(txt)])]
if strip_fields:
rtr=[e.strip().strip('\'"') for e in rtr]
return rtr
正则表达式不会帮助你在这种情况下,由于语言(即字符串组)你试图解析是不规则的。考虑到你允许任意嵌套标签,没有简单的方法来将你的出路解决。 –
正则表达式实际上不能处理这个,你不会想要它。复杂性至少是线性的,所以用奇偶校验器必然会获得更好的性能。尽管如此,你不必自己构建它。 Python的'csv'模块做了很多工作。 –
啊,不要说那个正则表达式无法处理它!也许蟒蛇的味道不能,但像PCRE这样的其他口味可以做到!这是[证明](http://regex101.com/r/wU7lC9),我们甚至可能会喜欢并使用递归模式来考虑嵌套的'<>()' – HamZa