2016-09-28 130 views
2

谁能帮助与正则表达式后提取文本短语“标题:”从以下文字:(刚才加粗,以清楚地描述要提取的部分文本)正则表达式从文本中提取标题

Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s): 



Effective date: 7/1/07 

Title: 

2003247 

or previous effective dates) 



Title: 

ST2 Assay for Chronic Heart Failure 

Description/Background 

Heart Failure 

HF is one among many cardiovascular diseases that comprises a major cause of morbidity 
and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .

我使用正则表达式:(?:Title: \n+(.*))|(?:Title:\n+(.*))|(?<=Title:)(.*)(?=Procedure)

但是,它似乎并不正确捕获的条款!我使用Python 2.7.12

+1

看来,[正则表达式匹配你需要什么(https://regex101.com/r/tWSH05/1)。 –

+0

有点多(更多的代码)会有帮助 – holdenweb

+0

你如何检索匹配? – Laurel

回答

0

我建议使用

Title:\s*(.*?)\s*Procedure|Title:\s*(.*) 

regex demo

详细

  • Title: - 文字文本Title:
  • \s* - 0+空格
  • (.*?) - 第1组:不是断行符号以外的任何字符0+尽可能少达第一个
  • \s*Procedure - 0+空格+字符串Procedure
  • | - 或
  • Title:\s* - Title:串+ 0+空格
  • (.*) - 组2:零个或多于换行符符号尽可能多(该行的其余部分)以外的任何字符。

Python code

import re 
regex = r"Title:\s*(.*?)\s*Procedure|Title:\s*(.*)" 
test_str = ("Title: Anorectal Fistula (Fistula-in-Ano) Procedure Code(s):\n\n" 
    "Effective date: 7/1/07\n\n" 
    "Title:\n\n" 
    "2003247\n\n" 
    "or previous effective dates)\n\n" 
    "Title:\n\n" 
    "ST2 Assay for Chronic Heart Failure\n\n" 
    "Description/Background\n\n" 
    "Heart Failure\n\n" 
    "HF is one among many cardiovascular diseases that comprises a major cause of morbidity and mortality worldwide. The term “heart failure” (HF) refers to a complex clinical syndrome .") 
res = [] 
for m in re.finditer(regex, test_str): 
    if m.group(1): 
     res.append(m.group(1)) 
    else: 
     res.append(m.group(2)) 
print(res) 
# => ['Anorectal Fistula (Fistula-in-Ano)', '2003247', 'ST2 Assay for Chronic Heart Failure']