多行字符串

匹配多个模式我有一些数据，看起来像：多行字符串

PMID- 19587274 
OWN - NLM 
DP - 2009 Jul 8 
TI - Domain general mechanisms of perceptual decision making in human cortex. 
PG - 8675-87 
AB - To successfully interact with objects in the environment, sensory evidence must 
     be continuously acquired, interpreted, and used to guide appropriate motor 
     responses. For example, when driving, a red 
AD - Perception and Cognition Laboratory, Department of Psychology, University of 
     California, San Diego, La Jolla, California 92093, USA. 

PMID- 19583148 
OWN - NLM 
DP - 2009 Jun 
TI - Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic 
     amyloidosis. 
PG - 482-6 
AB - BACKGROUND: Amyloidosis represents a group of different diseases characterized by 
     extracellular accumulation of pathologic fibrillar proteins in various tissues 
AD - Asklepios Hospital, Department of Medicine, Langen, Germany. 
     [email protected]

我想写一个正则表达式可以匹配随后PMID，TI和AB的句子。

是否有可能得到这些在一个镜头正则表达式？

我花了几乎整整一天，试图找出一个正则表达式，我能得到的最接近的是：

reg4 = r'PMID- (?P<pmid>[0-9]*).*TI.*- (?P<title>.*)PG.*AB.*- (?P<abstract>.*)AD' 
for i in re.finditer(reg4, data, re.S | re.M): print i.groupdict()

将返回我的比赛只是在数据的第二个“设置”，而不是全部。

有什么想法？谢谢！

来源

2009-09-01 e-Jah

如何：

import re 
reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI - (?P<title>.*?)^PG|AB - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL) 
for i in reg4.finditer(data): 
    print i.groupdict()

输出：

{'pmid': '19587274', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': None, 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': None} 
{'pmid': '19583148', 'abstract': None, 'title': None} 
{'pmid': None, 'abstract': None, 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'} 
{'pmid': None, 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': None}

编辑

作为一个详细的RE，以使其更容易理解（我想详细的RE应该用于任何东西，但最简单的表达方式，但这只是我的看法！）：

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^     # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     (?:     # Non capturing group with multiple options, first option: 
      PMID-\s   # Literal "PMID-" followed by a space 
      (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     |      # Next option: 
      TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
      (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
      ^PG    # The characters PG at the start of a line 
     |      # Next option 
      AB\s{2}-\s  # "AB - " 
      (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
      ^AD    # "AD" at the start of a line 
     ) 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict()

请注意，您可以将^PG和^AD替换为^\S以使其更通用（您希望匹配所有内容，直到行的第一个非空格为止）。

编辑2

如果你想赶上整个事情在一个正则表达式，摆脱了开始(?:，结束)和|字符更改为.*?的：

#!/usr/bin/python 
import re 
reg4 = re.compile(r''' 
     ^    # Start of a line (due to re.MULTILINE, this may match at the start of any line) 
     PMID-\s   # Literal "PMID-" followed by a space 
     (?P<pmid>[0-9]+) # Then a string of one or more digits, group as 'pmid' 
     .*?    # Next part: 
     TI\s{2}-\s  # "TI", two spaces, a hyphen and a space 
     (?P<title>.*?) # The title, a non greedy match that will capture everything up to... 
     ^PG    # The characters PG at the start of a line 
     .*?    # Next option 
     AB\s{2}-\s  # "AB - " 
     (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to... 
     ^AD    # "AD" at the start of a line 
     ''', re.MULTILINE | re.DOTALL | re.VERBOSE) 
for i in reg4.finditer(data): 
    print i.groupdict()

这给出：

{'pmid': '19587274', 'abstract': 'To successfully interact with objects in the environment, sensory evidence must\n  be continuously acquired, interpreted, and used to guide appropriate motor\n  responses. For example, when driving, a red \n', 'title': 'Domain general mechanisms of perceptual decision making in human cortex.\n'} 
{'pmid': '19583148', 'abstract': 'BACKGROUND: Amyloidosis represents a group of different diseases characterized by\n  extracellular accumulation of pathologic fibrillar proteins in various tissues\n', 'title': 'Ursodeoxycholic acid for treatment of cholestasis in patients with hepatic\n  amyloidosis.\n'}

来源

2009-09-01 09:14:15 DrAl

只是要补充一点，你原来的正则表达式中的一个问题可能是贪婪的'。*'模式太多了，e-Jah - 它太匹配了，因此“贪婪地吃掉了”所有的最后的记录作为贪婪匹配的一部分，所以你实际上得到了与最后一个条目的抽象/标题匹配的第一个条目的PMID（并且所有其他条目将在第一个匹配的第一个条目中被吃掉'。*'模式）。 – Amber 2009-09-01 09:18:13

该问题米是贪婪的预选赛。这里有一个正则表达式是比较具体，非贪婪：

#!/usr/bin/python 
import re 
from pprint import pprint 
data = open("testdata.txt").read() 

reg4 = r''' 
    ^PMID    # Start matching at the string PMID 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<pmid>[0-9]+) # Capture the field "pmid", accepting only numeric characters 
    .*?TI    # next, match any character up to the first occurrence of 'TI' 
    \s*?-    # as little whitespace as possible up to the next '-' 
    \s*?    # as little whitespace as possible 
    (?P<title>.*?)PG # capture the field <title> accepting any character up the the next occurrence of 'PG' 
    .*?AB    # match any character up to the following occurrence of 'AB' 
    \s*?-    # As little whitespace as possible up to the next '-' 
    \s*?    # As little whitespcase as possible 
    (?P<abstract>.*?)AD # capture the fiels <abstract> accepting any character up to the next occurrence of 'AD' 
''' 
for i in re.finditer(reg4, data, re.S | re.M | re.VERBOSE): 
    print 78*"-" 
    pprint(i.groupdict())

输出：

------------------------------------------------------------------------------ 
{'abstract': ' To successfully interact with objects in the environment, 
    sensory evidence must\n  be continuously acquired, interpreted, and 
    used to guide appropriate motor\n  responses. For example, when 
    driving, a red \n', 
'pmid': '19587274', 
'title': ' Domain general mechanisms of perceptual decision making in 
    human cortex.\n'} 
------------------------------------------------------------------------------ 
{'abstract': ' BACKGROUND: Amyloidosis represents a group of different 
    diseases characterized by\n  extracellular accumulation of pathologic 
    fibrillar proteins in various tissues\n', 
'pmid': '19583148', 
'title': ' Ursodeoxycholic acid for treatment of cholestasis in patients 
    with hepatic\n  amyloidosis.\n'}

您可能要strip每个字段的扫描后的空白。

来源

2009-09-01 09:22:03 exhuma

只有一点：如果在摘要中的标题或AD中有文本“PG”，这个正则表达式就会出现问题。添加'^'行首限定符将解决此问题。 – DrAl 2009-09-01 09:27:42

谢谢@Al。修复。 – exhuma 2009-09-01 10:14:16

另一个正则表达式：

reg4 = r'(?<=PMID-)(?P<pmid>.*?)(?=OWN -).*?(?<=TI -)(?P<title>.*?)(?=PG -).*?(?<=AB -)(?P<abstract>.*?)(?=AD -)'

来源

2009-09-01 09:33:01

如何不使用正则表达式完成这个任务，而是使用由新行分割，使用.startswith（）等着眼于前缀码的程序代码？代码会更长，但每个人都可以理解它，而无需进入帮助。

来源

2009-09-01 10:02:20

已经用很长的正则表达式回答了这个问题，我必须同意PēterisCaune的观点：'.startswith（）'代码风格最终可能会有点混乱，但与正则表达式所需的复杂性相比，它会更好。这也很容易理解。你也可以在网上找到一些现成的解析器来为你做这项工作...... – DrAl 2009-09-01 10:19:18

如果行的顺序可以改变，你可以使用这个模式：

reg4 = re.compile(r""" 
    ^
    (?: PMID \s*-\s* (?P<pmid> [0-9]+) \n 
    | TI \s*-\s* (?P<title> .* (?:\n[^\S\n].*)*) \n 
    | AB \s*-\s* (?P<abstract> .* (?:\n[^\S\n].*)*) \n 
    | .+\n 
    )+ 
""", re.MULTILINE | re.VERBOSE)

它将匹配连续的非空行，并捕获PMID，TI和AB项目。

项目值可以跨越多行，只要第一行后面的行以空格字符开始。

“[^\S\n]” 匹配任何空白字符（\s），除了换行（\n）。
“.* (?:\n[^\S\n].*)*”匹配以空白字符开头的连续行。
“.+\n”与任何其他非空行匹配。

来源

2009-09-01 10:23:10

回答

相关问题