2016-03-13 27 views
1

所以我想处理下面的文本。我想要的是从每个课程的学分开始按照季节和年份结束数据。所以对于第一类它看起来像这样:与多行正则表达式

3 credits in Philosophical Perspectives 
PHIL 101L 
PHILOSOPHICAL PERSPECTIVES 
B 
3 
Fall 2014 

另外我需要得到他们仍然需要的类。如果您注意到他们在历史中缺少3个学分。这里是我的文字:

3 credits in Philosophical Perspectives 
PHIL 101L 
PHILOSOPHICAL PERSPECTIVES 
B 
3 
Fall 2014 
Student View 
3 credits in Fine Arts 
ART 160L 
HIST WEST ART I 
B+ 
3 
Fall 2014 
3 credits in History 
Still Needed: 
Click here to see classes that satisfy this requirement. 
3 credits in Literature 
ENG 201L 
INTRO LINGUISTIC 
IP 
(3) 
Spring 2016 
3 credits in Math 
Still Needed: 
Click here to see classes that satisfy this requirement. 
3 credits in Natural Science 
BIOL 225L 
TOPICS IN NUTRITION 
A- 
3 
Spring 2015 
3 credits Ethics/Applied Ethics/Religious Studies 
REST 209L 
WORLD RELIGIONS 
A- 
3 
Spring 2015 
3 credits in Social Science 
ECON 104L 
PRINC MACROECONOM 
T 
3 
Fall 2014 
+1

还有,你试过吗?正则表达式有一个多行修饰符 –

+0

我只能得到这个。 (\ d credits)(。*)(?= \ n)。只抓住第一行。我对于正则表达式很新,并没有真正掌握它。 – MrCokeman

回答

0
(?:^|(?<=\n))\d+\s+credits[]\s\S]*?(?=\n\d+\s+credits|$) 

您可以findall。看到演示使用。

https://regex101.com/r/gK9aI6/1

import re 
p = re.compile(r'(?:^|(?<=\n))\d+\s+credits[]\s\S]*?(?=\n\d+\s+credits|$)') 
test_str = "3 credits in Philosophical Perspectives\nPHIL 101L\nPHILOSOPHICAL PERSPECTIVES\nB\n3\nFall 2014\nStudent View\n3 credits in Fine Arts\nART 160L\nHIST WEST ART I\nB+\n3\nFall 2014\n3 credits in History\nStill Needed:\nClick here to see classes that satisfy this requirement.\n3 credits in Literature\nENG 201L\nINTRO LINGUISTIC\nIP\n(3)\nSpring 2016\n3 credits in Math\nStill Needed:\nClick here to see classes that satisfy this requirement.\n3 credits in Natural Science\nBIOL 225L\nTOPICS IN NUTRITION\nA-\n3\nSpring 2015\n3 credits Ethics/Applied Ethics/Religious Studies\nREST 209L\nWORLD RELIGIONS\nA-\n3\nSpring 2015\n3 credits in Social Science\nECON 104L\nPRINC MACROECONOM\nT\n3\nFall 2014" 

re.findall(p, test_str) 
+0

感谢这个答案效果最好! – MrCokeman

0

您可以结合非贪婪“什么”序列,并使用每组的最后一行的已知结构,把它解析成大块:

/((?:.\n?)*?(?:Fall|Summer|Spring|Winter)\s\d{4})/g 
  1. (?:.\n?)*? - 吃任何字符(可能后面带有换行符)一次性
  2. 然后简单地与最终序列:(?:Fall|Summer|Spring|Winter)\s\d{4}

See the demo here和注意,每个信贷实际上是在单一的正则表达式匹配。

0

尝试下面的代码片段:

import re 

courses = r"....your...content" 

rx = re.compile(r"\d+.*?(?:FALL|SPRING)\s*\d{4}", re.IGNORECASE | re.DOTALL) 
for course in rx.finditer(courses): 
    print(course.group()) 
    print("----------------------------\n") 

如果courses包含示例内容,输出将是:

3 credits in Philosophical Perspectives 
PHIL 101L 
PHILOSOPHICAL PERSPECTIVES 
B 
3 
Fall 2014 
---------------------------- 

3 credits in Fine Arts 
ART 160L 
HIST WEST ART I 
B+ 
3 
Fall 2014 
---------------------------- 

3 credits in History 
Still Needed: 
Click here to see classes that satisfy this requirement. 
3 credits in Literature 
ENG 201L 
INTRO LINGUISTIC 
IP 
(3) 
Spring 2016 
---------------------------- 

... omitting rest....