如何省略pyparsing中的重复项？

好吧，我终于得到了我的语法来捕获我所有的测试用例，但我有一个重复的（情况3）和一个假阳性（情况6，“模式5”）。这里是我的test cases和我的desired output。我仍然很新的python（虽然能够教我的孩子！可怕！），所以我相信有明显的方法来解决这个问题，我甚至不知道这是一个pyparsing问题。下面是我的输出看起来像现在：如何省略pyparsing中的重复项？

['01/01/01','S01-12345','20/111-22-1001',['GLEASON', ['5', '+', '4'], '=', '9']] 
['02/02/02','S02-1234','20/111-22-1002',['GLEASON', 'SCORE', ':', ['3', '+', '3'], '=', '6']] 
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'GRADE', ['4', '+', '3'], '=', '7']] 
['03/02/03','S03-1234','31/111-22-1003',['GLEASON', 'SCORE', ':', '7', '=', ['4', '+', '3']]] 
['04/17/04','S04-123','30/111-22-1004',['GLEASON', 'SCORE', ':', ['3', '+', '4', '-', '7']]] 
['05/28/05','S05-1234','20/111-22-1005',['GLEASON', 'SCORE', '7', '[', ['3', '+', '4'], ']']] 
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', ['4', '+', '3']]] 
['06/18/06','S06-10686','20/111-22-1006',['GLEASON', 'PATTERN', '5']] 
['07/22/07','S07-2749','20/111-22-1007',['GLEASON', 'SCORE', '6', '(', ['3', '+', '3'], ')']]

这里的语法

num = Word(nums) 
arith_expr = operatorPrecedence(num, 
    [ 
    (oneOf('-'), 1, opAssoc.RIGHT), 
    (oneOf('* /'), 2, opAssoc.LEFT), 
    (oneOf('+ -'), 2, opAssoc.LEFT), 
    ]) 
accessionDate = Combine(num + "/" + num + "/" + num)("accDate") 
accessionNumber = Combine("S" + num + "-" + num)("accNum") 
patMedicalRecordNum = Combine(num + "/" + num + "-" + num + "-" + num)("patientNum") 
score = (Optional(oneOf('([')) + 
     arith_expr('lhs') + 
     Optional(oneOf(') ]')) + 
     Optional(oneOf('= -')) + 
     Optional(oneOf('([')) + 
     Optional(arith_expr('rhs')) + 
     Optional(oneOf(') ]'))) 
gleason = Group("GLEASON" + Optional("SCORE") + Optional("GRADE") + Optional("PATTERN") + Optional(":") + score) 
patientData = Group(accessionDate + accessionNumber + patMedicalRecordNum) 
partMatch = patientData("patientData") | gleason("gleason")

和输出功能。如你所见，输出效果不如看起来好，我只是写入一个文件并伪造一些语法。我一直在为如何获得pyparsing中间结果而苦苦挣扎，所以我可以与他们合作。我应该写出来并运行第二个脚本来查找重复内容吗？

更新，基于Paul McGuire的回答。这个函数的输出让我每次输入一行，但是现在我输了一些分数（每个格里森分数，智力上，格式为primary + secondary = total。这是针对数据库的，所以pri，sec，tot是独立posgresql列，或者，解析器的输出，逗号分隔值）

accumPatientData = None 
for match in partMatch.searchString(TEXT): 
    if match.patientData: 
     if accumPatientData is not None: 
      #this is a new patient data, print out the accumulated 
      #Gleason scores for the previous one 
      writeOut(accumPatientData) 
     accumPatientData = (match.patientData, []) 
    elif match.gleason: 
     accumPatientData[1].append(match.gleason) 
if accumPatientData is not None: 
    writeOut(accumPatientData)

所以现在输出看起来像这样

01/01/01,S01-12345,20/111-22-1001,9 
02/02/02,S02-1234,20/111-22-1002,6 
03/02/03,S03-1234,31/111-22-1003,7,4+3 
04/17/04,S04-123,30/111-22-1004, 
05/28/05,S05-1234,20/111-22-1005,3+4 
06/18/06,S06-10686,20/111-22-1006,, 
07/22/07,S07-2749,20/111-22-1007,3+3

我想在那里伸出手抓住一些丢失的元素，重新排列它们，找出丢失的元素，然后将它们全部放回。类似于以下伪代码：

def diceGleason(glrhs,gllhs) 
    if glrhs.len() == 0: 
     pri = gllhs[0] 
     sec = gllhs[2] 
     tot = pri + sec 
     return [pri, sec, tot] 
    elif glrhs.len() == 1: 
     pri = gllhs[0] 
     sec = gllhs[2] 
     tot = glrhs 
     return [pri, sec, tot] 
    else: 
     pri = glrhs[0] 
     sec = glrhs[2] 
     tot = gllhs 
     return [pri, sec, tot]

更新2：好的，保罗很棒，但我很笨。在尝试了他所说的话之后，我尝试了几种方法来获得pri，sec和tot，但是我失败了。我不断收到这样的错误：

Traceback (most recent call last): 
    File "Stage1.py", line 81, in <module> 
    writeOut(accumPatientData) 
    File "Stage1.py", line 47, in writeOut 
    FOUT.write("{0.accDate},{0.accNum},{0.patientNum},{1.pri},{1.sec},{1.tot}\n".format(pd, gleaso 
nList)) 
AttributeError: 'list' object has no attribute 'pri'

这些AttributeErrors是我不断收到的。显然，我不明白之间发生了什么（保罗，我有这本书，我发誓它在我面前是开放的，我不明白）。这里是my script。有什么地方错了吗？我是否称结果错误？

来源

2013-08-27 Niels

到“所需输出”文件的链接似乎被打破。 – Michael0x2a

谢谢，修复！ – Niels

如果为单个患者数据定义多个Gleason分数，我不会看到您的操作。你只是拿第一个？最后一个？或者它们是多余的，你选择哪一个并不重要？ – PaulMcG

我没有对您的解析器进行单个更改，而是对解析后的代码进行了一些更改。

你并没有真正得到“重复”，问题在于，每次看到格里森评分时，都会打印出当前患者的数据，并且您的一些患者数据记录中包含多个格里森评分项。如果我明白你正在尝试做的，这里是伪代码，我将遵循：

accumulator = None 
foreach match in (patientDataExpr | gleasonScoreExpr).searchString(source): 

    if it's a patientDataExpr: 
     if accumulator is not None: 
      # we are starting a new patient data record, print out the previous one 
      print out accumulated data 
     initialize new accumulator with current match and empty list for gleason data 

    else if it's a gleasonScoreExpr: 
     add this expression into the current accumulator 

# done with the for loop, do one last printout of the accumulated data 
if accumulator is not None: 
    print out accumulated data

把它转换到Python很容易地：

def printOut(patientDataTuple): 
    pd,gleasonList = patientDataTuple 
    print("['{0.accDate}','{0.accNum}','{0.patientNum}',{1}]".format(
     pd, ','.join(''.join(gl.rhs) for gl in gleasonList))) 

accumPatientData = None 
for match in partMatch.searchString(TEXT): 
    if match.patientData: 
     if accumPatientData is not None: 
      # this is a new patient data, print out the accumulated 
      # Gleason scores for the previous one 
      printOut(accumPatientData) 

     # start accumulating for a new patient data entry 
     accumPatientData = (match.patientData, []) 

    elif match.gleason: 
     accumPatientData[1].append(match.gleason) 
    #~ print match.dump() 

if accumPatientData is not None: 
    printOut(accumPatientData)

我不认为我倾倒格里森的数据正确，但你可以从这里调整，我想。

编辑：

您可以将diceGleason作为一个解析动作gleason并获得此行为：

def diceGleasonParseAction(tokens): 
    def diceGleason(glrhs,gllhs): 
     if len(glrhs) == 0: 
      pri = gllhs[0] 
      sec = gllhs[2] 
      #~ tot = pri + sec 
      tot = str(int(pri)+int(sec)) 
      return [pri, sec, tot] 
     elif len(glrhs) == 1: 
      pri = gllhs[0] 
      sec = gllhs[2] 
      tot = glrhs 
      return [pri, sec, tot] 
     else: 
      pri = glrhs[0] 
      sec = glrhs[2] 
      tot = gllhs 
      return [pri, sec, tot] 

    pri,sec,tot = diceGleason(tokens.gleason.rhs, tokens.gleason.lhs) 

    # assign results names for later use 
    tokens.gleason['pri'] = pri 
    tokens.gleason['sec'] = sec 
    tokens.gleason['tot'] = tot 

gleason.setParseAction(diceGleasonParseAction)

你刚在那里为你总结pri和sec得到tot一个错字，但这些都是所有的字符串，所以你加入'3'和'4'并得到'34' - 转换为整数来完成所需要的。否则，我将diceGleason逐字内部地保存到diceGleasonParseAction，以便将用于推断pri,sec和tot的逻辑与用新结果名称修饰解析令牌的机制隔离开来。由于分析操作不会返回任何新内容，因此令牌会在原位更新，然后在随后的输出方法中继续使用。

来源

2013-08-28 02:30:58 PaulMcG

保罗，再次感谢你。你可能花更多的时间教会我，而不是让你这样做，所以我倍加赞赏。我调整了一下，但仍然失去了分数。 == A ==我认为我坚持使用列表而不是元组：如果你阅读测试用例，情况6意味着4 + 3 = 7，然后说有一个三级（通常不报道）5分。所以7从来没有明确说过，这意味着元组将会是一个短元素。 == B ==我仍然无法重新排列主要和次要分数出现在右侧的那些。 – Niels

我将此标记为已接受，但我希望您愿意看看我最后的一点：我不断收到属性错误。文件位于http://nielsolson.us/pastebin/ – Niels

如何省略pyparsing中的重复项？

回答

相关问题