2017-03-16 40 views
1

这是我第一次尝试使用pyparsing,我很难设置它。我想用pyparsing来解析lexc文件。格式lexc用于声明编译成有限状态传感器的词典。使用python(pyparsing)来解析lexc

特殊字符:

: divides 'upper' and 'lower' sides of a 'data' declaration 
; terminates entry 
# reserved LEXICON name. end-of-word or final state 
' ' (space) universal delimiter 
! introduces comment to the end of the line 
< introduces xfst-style regex 
> closes xfst-style regex 
% escape character: %: %; %# % %! %< %> %% 

有多个层次来解析。

通用来说,任何从非转义!到换行符都是一条评论。这可以在每个级别单独处理。

在文档级别,有三个不同的部分:

Multichar_Symbols Optional one-time declaration 
LEXICON    Usually many of these 
END     Anything after this is ignored 

Multichar_Symbols水平,任何用空格隔开是一个声明。本部分结尾于LEXICON的第一个声明。

Multichar_Symbols the+first-one thesecond_one 
third_one ! comment that this one is special 
+Pl  ! plural 

LEXICON水平,LEXICON的名字被声明为:

LEXICON the_name ! whitespace delimited 

名称声明之后,一个词库条目组成:data continuation ;。分号分隔条目。 data是可选的。

data水平,有三种可能的形式:

  1. upper:lower

  2. simple(其被分解到upperlowersimple:simple

  3. <xfst-style regex>

例子:

END一切后
! # is a reserved continuation that means "end of word". 
dog+Pl:dogs # ; ! upper:lower continuation ; 
cat # ;   ! automatically exploded to "cat:cat # ;" by interpreter 
Num ;   ! no data, only a continuation to LEXICON named "Num" 
<[1|2|3]+> # ; ! xfst-style regex enclosed in <> 

被忽略

完整lexc文件可能是这样的:

! Comments begin with ! 

! Multichar_Symbols (separated by whitespace, terminated by first declared LEXICON) 
Multichar_Symbols +A +N +V ! +A is adjectives, +N is nouns, +V is verbs 
+Adv ! This one is for adverbs 
+Punc ! punctuation 
! +Cmpar ! This is broken for now, so I commented it out. 

! The bulk of lexc is made of up LEXICONs, which contain entries that point to 
! other LEXICONs. "Root" is a reserved lexicon name, and the start state. 
! "#" is also a reserved lexicon name, and the end state. 

LEXICON Root ! Root is a reserved lexicon name, if it is not declared, then the first LEXICON is assumed to be the root 
big Adj ; ! This 
bigly Adv ; ! Not sure if this is a real word... 
dog Noun ; 
cat Noun ; 
crow Noun ; 
crow Verb ; 
Num ;  ! This continuation class generates numbers using xfst-style regex 

! NB all the following are reserved characters 

sour% cream Noun ; ! escaped space 
%: Punctuation ; ! escaped : 
%; Punctuation ; ! escaped ; 
%# Punctuation ; ! escaped # 
%! Punctuation ; ! escaped ! 
%% Punctuation ; ! escaped % 
%< Punctuation ; ! escaped < 
%> Punctuation ; ! escaped > 

%:%:%::%: # ; ! Should map ::: to : 

LEXICON Adj 
+A: # ;  ! # is a reserved lexicon name which means end-of-word (final state). 
! +Cmpar:er # ; ! Broken, so I commented it out. 

LEXICON Adv 
+Adv: # ; 

LEXICON Noun 
+N+Sg: # ; 
+N+Pl:s # ; 

LEXICON Num 
<[0|1|2|3|4|5|6|7|8|9]> Num ; ! This is an xfst regular expression and a cyclic continuation 
# ; ! After the first cycle, this makes sense, but as it is, this is bad. 

LEXICON Verb 
+V+Inf: # ; 
+V+Pres:s # ; 

LEXICON Punctuation 
+Punc: # ; 

END 

This text is ignored because it is after END 

因此,有多个不同的层次处进行解析。在pyparsing中设置此项的最佳方法是什么?这种分层语言有没有例子可以作为模型遵循?

回答

1

使用pyparsing时的策略是将解析问题分解成小部分,然后将它们组合成更大的部分。

开始你的第一个高层次的结构定义:

Multichar_Symbols Optional one-time declaration 
LEXICON    Usually many of these 
END     Anything after this is ignored 

你最终整体解析器会看起来像:

parser = (Optional(multichar_symbols_section)('multichar_symbols') 
      + Group(OneOrMore(lexicon_section))('lexicons') 
      + END) 

的名称在括号内各部分后,会给我们标签使它容易访问解析结果的不同部分。

深入细节,我们来看看如何定义lexicon_section的解析器。

首先定义标点符号和特殊的关键字

COLON,SEMI = map(Suppress, ":;") 
HASH = Literal('#') 
LEXICON, END = map(Keyword, "LEXICON END".split()) 

你的标识符和值可以包含“%” - 转义字符,所以我们需要从片建立他们:

# use regex and Combine to handle % escapes 
escaped_char = Regex(r'%.').setParseAction(lambda t: t[0][1]) 
ident_lit_part = Word(printables, excludeChars=':%;') 
xfst_regex = Regex(r'<.*?>') 
ident = Combine(OneOrMore(escaped_char | ident_lit_part)) | xfst_regex 
value_expr = ident() 

随着这些作品,我们现在可以定义单个词典声明:

# handle the following lexicon declarations: 
# name ; 
# name:value ; 
# name value ; 
# name value # ; 
lexicon_decl = Group(ident("name") 
        + Optional(Optional(COLON) 
           + value_expr("value") 
           + Optional(HASH)('hash')) 
        + SEMI) 

这部分是有点混乱,事实证明,value可以作为字符串,结果结构(pyparsing ParseResults)返回,或者甚至可能完全丢失。我们可以使用分析操作将所有这些表单规范化为单个字符串形式。

# use a parse action to normalize the parsed values 
def fixup_value(tokens): 
    if 'value' in tokens[0]: 
     # pyparsing makes this a nested element, just take zero'th value 
     if isinstance(tokens[0].value, ParseResults): 
      tokens[0]['value'] = tokens[0].value[0] 
    else: 
     # no value was given, expand 'name' as if parsed 'name:name' 
     tokens[0]['value'] = tokens[0].name 
lexicon_decl.setParseAction(fixup_value) 

现在该值将在分析时清除,因此在调用parseString之后不需要额外的代码。

我们终于准备来定义整个LEXICON部分:

# TBD - make name optional, define as 'Root' 
lexicon_section = Group(LEXICON 
         + ident("name") 
         + ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations")) 

看家的最后一点 - 我们需要忽略的意见。我们可以在最上方的分析器表达打电话ignore和意见将在整个解析器忽略:

# ignore comments anywhere in our parser 
comment = '!' + Optional(restOfLine) 
parser.ignore(comment) 

这里是一个单拷贝pasteable节的所有代码:

import pyparsing as pp 

# define punctuation and special words 
COLON,SEMI = map(pp.Suppress, ":;") 
HASH = pp.Literal('#') 
LEXICON, END = map(pp.Keyword, "LEXICON END".split()) 

# use regex and Combine to handle % escapes 
escaped_char = pp.Regex(r'%.').setParseAction(lambda t: t[0][1]) 
ident_lit_part = pp.Word(pp.printables, excludeChars=':%;') 
xfst_regex = pp.Regex(r'<.*?>') 
ident = pp.Combine(pp.OneOrMore(escaped_char | ident_lit_part | xfst_regex)) 
value_expr = ident() 


# handle the following lexicon declarations: 
# name ; 
# name:value ; 
# name value ; 
# name value # ; 
lexicon_decl = pp.Group(ident("name") 
        + pp.Optional(pp.Optional(COLON) 
           + value_expr("value") 
           + pp.Optional(HASH)('hash')) 
        + SEMI) 

# use a parse action to normalize the parsed values 
def fixup_value(tokens): 
    if 'value' in tokens[0]: 
     # pyparsing makes this a nested element, just take zero'th value 
     if isinstance(tokens[0].value, pp.ParseResults): 
      tokens[0]['value'] = tokens[0].value[0] 
    else: 
     # no value was given, expand 'name' as if parsed 'name:name' 
     tokens[0]['value'] = tokens[0].name 
lexicon_decl.setParseAction(fixup_value) 

# define a whole LEXICON section 
# TBD - make name optional, define as 'Root' 
lexicon_section = pp.Group(LEXICON 
         + ident("name") 
         + pp.ZeroOrMore(lexicon_decl, stopOn=LEXICON | END)("declarations")) 

# this part still TBD - just put in a placeholder for now 
multichar_symbols_section = pp.empty() 

# tie it all together 
parser = (pp.Optional(multichar_symbols_section)('multichar_symbols') 
      + pp.Group(pp.OneOrMore(lexicon_section))('lexicons') 
      + END) 

# ignore comments anywhere in our parser 
comment = '!' + pp.Optional(pp.restOfLine) 
parser.ignore(comment) 

解析您发布的 '根' 的样品,我们可以使用dump()

result = lexicon_section.parseString(lexicon_sample)[0] 
print(result.dump()) 

给予转储结果:

['LEXICON', 'Root', ['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']] 
- declarations: [['big', 'Adj'], ['bigly', 'Adv'], ['dog', 'Noun'], ['cat', 'Noun'], ['crow', 'Noun'], ['crow', 'Verb'], ['Num'], ['sour cream', 'Noun'], [':', 'Punctuation'], [';', 'Punctuation'], ['#', 'Punctuation'], ['!', 'Punctuation'], ['%', 'Punctuation'], ['<', 'Punctuation'], ['>', 'Punctuation'], [':::', ':', '#']] 
    [0]: 
    ['big', 'Adj'] 
    - name: 'big' 
    - value: 'Adj' 
    [1]: 
    ['bigly', 'Adv'] 
    - name: 'bigly' 
    - value: 'Adv' 
    [2]: 
    ['dog', 'Noun'] 
    - name: 'dog' 
    - value: 'Noun' 
    ... 
    [13]: 
    ['<', 'Punctuation'] 
    - name: '<' 
    - value: 'Punctuation' 
    [14]: 
    ['>', 'Punctuation'] 
    - name: '>' 
    - value: 'Punctuation' 
    [15]: 
    [':::', ':', '#'] 
    - hash: '#' 
    - name: ':::' 
    - value: ':' 
- name: 'Root' 

这段代码演示了如何遍历部分的零件和获得命名的部分:

# try out a lexicon against the posted sample 
result = lexicon_section.parseString(lexicon_sample)[0] 
print(result.dump()) 

print('Name:', result.name) 
print('\nDeclarations') 
for decl in result.declarations: 
    print("{name} -> {value}".format_map(decl), "(END)" if decl.hash else '') 

,并提供:

Name: Root 

Declarations 
big -> Adj 
bigly -> Adv 
dog -> Noun 
cat -> Noun 
crow -> Noun 
crow -> Verb 
Num -> Num 
sour cream -> Noun 
: -> Punctuation 
; -> Punctuation 
# -> Punctuation 
! -> Punctuation 
% -> Punctuation 
< -> Punctuation 
> -> Punctuation 
::: -> : (END) 

希望这会给你足够的把它从这里。

+0

哇!这是比我预期的更彻底的答案!星期一我会有更多的时间来看。谢谢! – reynoldsnlp

+0

我不明白'value_expr = ident()'在做什么。 b/w'ident'和'value_expr'有什么区别?它们似乎都是同一类型的对象。 – reynoldsnlp

+0

这是一个很好的区别,'value_expr = ident'也可以。区别在于'ident()'返回'ident'的一个副本('value_expr = ident.copy()'的简写形式),所以如果您想要将解析操作或其他功能附加到-ident-expression - 这是一个右手边值,那么你可以安全地在'value_expr'上做,'ident'不会受到影响。 – PaulMcG