2009-09-01 66 views
0

匹配多个模式我有一些数据,看起来像:多行字符串

PMID- 19587274 
OWN - NLM 
DP - 2009 Jul 8 
TI - Domain general mechanisms of perceptual decision making in human cortex. 
PG - 8675-87 
AB - To successfully interact with objects in the environment, sensory evidence must 
     be continuously acquired, interpreted, and used to guide appropriate motor 
     responses. For example, when driving, a red 
AD - Perception and Cognition Laboratory, Department of Psychology, University of 
     California, San Diego, La Jolla, California 92093, USA. 

PMID- 19583148 
OWN - NLM 
DP - 2009 Jun 
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic 
     amyloidosis. 
PG - 482-6 
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by 
     extracellular accumulation of pathologic fibrillar proteins in various tissues 
AD - Asklepios Hospital, Department of Medicine, Langen, Germany. 
     [email protected] 

我想写一个正则表达式可以匹配随后PMID,TI和AB的句子。

是否有可能得到这些在一个镜头正则表达式?

我花了几乎整整一天,试图找出一个正则表达式,我能得到的最接近的是:

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD' 
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict() 

将返回我的比赛只是在数据的第二个“设置”,而不是全部。

有什么想法?谢谢!

回答

2

如何:

import re 
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL) 
for i in reg4.finditer(data): 
    print i.groupdict() 

输出:

{'pmid': '19587274', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': None} 
{'pmid': '19583148', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None} 

编辑

作为一个详细的RE,以使其更容易理解(我想详细的RE应该用于任何东西,但最简单的表达方式,但这只是我的看法!):

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^     # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     (?:     # Non capturing group with multiple options, first option: 
      PMID-\s   # Literal "PMID-" followed by a space 
      (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     |      # Next option: 
      TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
      (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
      ^PG    # The characters PG at the start of a line 
     |      # Next option 
      AB\s{2}-\s  # "AB - " 
      (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
      ^AD    # "AD" at the start of a line 
     ) 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict() 

请注意,您可以将^PG^AD替换为^\S以使其更通用(您希望匹配所有内容,直到行的第一个非空格为止)。

编辑2

如果你想赶上整个事情在一个正则表达式,摆脱了开始(?:,结束)|字符更改为.*?的:

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^    # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     PMID-\s   # Literal "PMID-" followed by a space 
     (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     .*?    # Next part: 
     TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
     (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
     ^PG    # The characters PG at the start of a line 
     .*?    # Next option 
     AB\s{2}-\s  # "AB - " 
     (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
     ^AD    # "AD" at the start of a line 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict() 

这给出:

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
+0

只是要补充一点,你原来的正则表达式中的一个问题可能是贪婪的'。*'模式太多了,e-Jah - 它太匹配了,因此“贪婪地吃掉了”所有的最后的记录作为贪婪匹配的一部分,所以你实际上得到了与最后一个条目的抽象/标题匹配的第一个条目的PMID(并且所有其他条目将在第一个匹配的第一个条目中被吃掉'。*'模式)。 – Amber 2009-09-01 09:18:13

0

该问题米是贪婪的预选赛。这里有一个正则表达式是比较具体,非贪婪:

#!/usr/bin/python 
import re 
from pprint import pprint 
data = open("testdata.txt").read() 

reg4 = r''' 
    ^PMID    # Start matching at the string PMID 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters 
    .*?TI    # next, match any character up to the first occurrence of 'TI' 
    \s*?-    # as little whitespace as possible up to the next '-' 
    \s*?    # as little whitespace as possible 
    (?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG' 
    .*?AB    # match any character up to the following occurrence of 'AB' 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD' 
''' 
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE): 
    print 78*"-" 
    pprint(i.groupdict()) 

输出:

------------------------------------------------------------------------------ 
{'abstract': ' To successfully interact with objects in the environment, 
    sensory evidence must\n  be continuously acquired, interpreted, and 
    used to guide appropriate motor\n  responses. For example, when 
    driving, a red \n', 
'pmid': '19587274', 
'title': ' Domain general mechanisms of perceptual decision making in 
    human cortex.\n'} 
------------------------------------------------------------------------------ 
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different 
    diseases characterized by\n  extracellular accumulation of pathologic 
    fibrillar proteins in various tissues\n', 
'pmid': '19583148', 
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients 
    with hepatic\n  amyloidosis.\n'} 

您可能要strip每个字段的扫描后的空白。

+0

只有一点:如果在摘要中的标题或AD中有文本“PG”,这个正则表达式就会出现问题。添加'^'行首限定符将解决此问题。 – DrAl 2009-09-01 09:27:42

+0

谢谢@Al。修复。 – exhuma 2009-09-01 10:14:16

0

另一个正则表达式:

reg4 = r'(?<=PMID-)(?P<pmid>.*?)(?=OWN -).*?(?<=TI -)(?P<title>.*?)(?=PG -).*?(?<=AB -)(?P<abstract>.*?)(?=AD -)' 
2

如何不使用正则表达式完成这个任务,而是使用由新行分割,使用.startswith()等着眼于前缀码的程序代码? 代码会更长,但每个人都可以理解它,而无需进入帮助。

+0

已经用很长的正则表达式回答了这个问题,我必须同意PēterisCaune的观点:'.startswith()'代码风格最终可能会有点混乱,但与正则表达式所需的复杂性相比,它会更好。这也很容易理解。你也可以在网上找到一些现成的解析器来为你做这项工作...... – DrAl 2009-09-01 10:19:18

0

如果行的顺序可以改变,你可以使用这个模式:

reg4 = re.compile(r""" 
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+) \n 
    | TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)*) \n 
    | AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)*) \n 
    | .+\n 
    )+ 
""", re.MULTILINE | re.VERBOSE) 

它将匹配连续的非空行,并捕获PMIDTIAB项目。

项目值可以跨越多行,只要第一行后面的行以空格字符开始。

  • [^\S\n]” 匹配任何空白字符(\s),除了换行(\n)。
  • .* (?:\n[^\S\n].*)*”匹配以空白字符开头的连续行。
  • .+\n”与任何其他非空行匹配。